In the last Lab we learn the basics of regular expressions. In this Lab we are going to learn the basics of text processing and put into practice regular expressions.
sed
and awk
are tools used by anyone that needs to work with text files.
awk
, named after its developers (Aho, Weinberger and Kernighan), is a programming language that permits manipulation of structured data.
sed
, which stands for stream editor, is typically applied to perform edits to text files. It allows you to edit files without using a text editor and, most importantly, edit files in a batch.
sed
accepts input in the form of text lines. During its execution, it goes through its input one line at a time, and applies one or more commands to each one of them individually, and then outputs the processed lines to standard output.
If you want to invoke sed on the contents of a file, you can invoke as shown in the following example, in which we replace the word “quick” by “fast”:
[you@blue ~]$ cat << EOF > somefile.txt > The quick brown fox > jumped over the lazy dog. > EOF [you@blue ~]$ sed 's/quick/fast/' somefile.txt The fast brown fox jumped over the lazy dog.
Notice how when running this command the output of sed
will be written to stdout, and the contents of somefile.txt
were not modified.
(To actually modify the contents of somefile
you can use redirection, or you can use the -i
option to edit files in place.)
If you want sed to take input from the output of other commands you typically invoke it in this form:
[you@blue ~]$ echo -e "The quick brown foxnjumped over the lazy dog." | sed 's/quick/fast/' The fast brown fox jumped over the lazy dog.
sed
supports many different types of commands.
The commands that we executed previously are examples of the substitute command, which is the sed command that is most commonly used.
sed
command elements are typically separated by a “slash” (/
) character, however you can use any character as a delimiterm except a new line character.
In our examples, the substitute command is indicated by the s
.
The substitute command has the following syntax:
[address]s/pattern/replacement/flags
When using the substitute command, the second token (quick
) corresponds to a regular expression to match, the third token (fast
) indicates the replacement text. The substitute command also accepts a fourth token, which in our examples is ommitted, and that corresponds to a flag that could be:
n
: A number (between 1 and 512) that indicates that the replacement should be made for only the n th occurrence of the patterng
: make changes globally on all ocurrences of the pattern.p
: prints the pattern spaceIf we want to modify a specific line in the text input, we can do so by using the address element of the command. Adresses can be specified in different ways:
Notation | Description |
---|---|
n | The command will be applied only to line number n |
$ | The last line |
n1,n2 | A range of lines from n1 to n2 |
n-y | Line number n then each subsequent line at y intervals |
n1,+n2 | Line n1 and the following n2 lines |
n! | All lines except line n |
Notice that the lines were the sed commands actually take effect are only those lines where the supplied pattern has a match. For example:
[you@blue ~]$ echo -e "The quick brown foxnjumped over the lazy dog." | sed '2s/quick/fast/' The quick brown fox jumped over the lazy dog.
The following table summarizes other commands available with sed
Command | Syntax | Description |
---|---|---|
p | [address]/regexp/p | Prints the lines |
d | [address]/regexp/d | Deletes lines that match the address or pattern |
The print command works very similar to the p
flag of the substitution command, only that it does not require you to actually perform a substitution.
Note
Controlling sed output*
Note that by default, sed
will output every line. If you want to have fine control of the output, then you must use the -n
flag (quiet mode) that will cause to only output linesthat are explicitly printed by means of a print command or a command that has the p
flag.
So far, we have seen examples of sed
where only one command is being executed. sed
can run multiple commands by using the -e
option:
[you/@blue ~]$ echo -e "The quick brown foxnjumped over the lazy dog." | sed -e 's/quick/fast/' -e 's/brown/red/' The fast red fox jumped over the lazy dog.
cut
is used to extract a section of text from a line and output the extrated section.
Please refer to the man pages of cut to get familiar with the different options it supports.
We have seen the sort
command in previous labs.
Its purpose is to sort the contents of standard input or one or more files.
The default sort order is ascending by character value.
Consult the man pages to see the different options it supports.
We have also seen the uniq
command in previous labs.
uniq
removes duplicated entries from a file.
Note that uniq does not sort the input, it detects a duplicate if the previous line was identical to the current line that is processing.
In order for uniq to remove duplicates, most likely you need to process the data with sort
beforehand.
Also note that the GNU version of sort
has the -u
option to remove duplicates. which renders a subsequent call to uniq
unnecessary.
However uniq has some interesting options to instead of removing duplicates, actually print duplicates.
awk
is way more than a tool, it is a programming language by itself. In this class we are going to learn only its most basic usage through some very basic examples.
In the following example we extract datafields from a string.
awk
uses whitespace characters as its defaul field delimiter. The awk
field variables start at $1
and increment up through the end of the string. In the example, there are 9 fields.
The variable $0
corresponds to entire line, and the variable NF
contains the number of fields in the current line.
[you@blue ~]$ echo 'The quick brown fox jumped over the lazy dog' | awk '{ print $1, $9, $5, $6, $7, $4}' The dog jumped over the fox
Notice how you can print the last field by using the $NF
variable:
[you@blue ~]$ echo 'The quick brown fox jumped over the lazy dog' | awk '{ print "The input was: "$0". The last field is "$NF}' The input was: The quick brown fox jumped over the lazy dog. The last field is dog
If the data is not formatted using spaces as separators, we can specify the delimiter by using the -F
option.
[you@blue ~]$ echo 'this-is-a-hyphen-delimited-text' | awk -F'-' '{ print $4, $5}' hyphen delimited
A common need is to be able to pass variables from bash to awk.
The following example demonstrate an example of how to achieve this (the -v
option is used to assign variables and their values)
[you@blue ~]$ VAR=3 [you@blue ~]$ echo "The quick brown fox" | awk -v myvar=$VAR '{print $myvar}' brown
You can also perform substring extraction with awk
through its substr
function. The syntax of this function is
substr(string, position of first character, length of substring)
The position of the first character is 1-indexed (that is the first character has a position equal to one, not zero)
[you@blue ~]$ echo -e "This is linenumberonenThis is linenumbertwo" | awk '{print substr($3,11,3)}' one two
In the previous examples awk
was reading input from stdin. It can also read directly from files.
[you@blue ~]$ cat << EOF > datafile > The brown fox 1234 > That quick dog 2131 > This lazy parrot 7988 > The slow snail 2344 > EOF [you@blue ~]$ awk '{print $3}' datafile fox dog parrot snail
In the next example we produce output only if the first field matches a certain value:
[you@blue ~]$ awk '$1 == "The" {print $3}' datafile fox snail
We can also apply regular expressions by using the ~
operator:
[you@blue ~]$ awk '$1 ~ /^Th[ai]/ {print $3}' datafile
dog
parrot