These two posters show the difference between professor Edsger Dijkstra and the author of Perl programming language, Larry Wall.

Edsger Dijkstra, Quick and Dirty
Edsger Dijkstra - Quick and Dirty, I Wouldn't Like It.

Larry Wall, Quick and Dirty
Larry Wall - Quick and Dirty, I'd Hit It.

See also what John McCarthy would say about your code.

Edsger DijkstraI found a great video interview with Edsger Wybe Dijkstra. You have probably heard of Dijkstra's algorithm. He invented it.

In the interview professor Edsger talks about his thoughts on software development. He compares two very different styles of programming - Mozart style of programming vs. Beethoven style of programming. When Mozart started to write, the composition was finished. He wrote manuscript in elegant handwriting in one go. Beethoven was a doubter and a struggler. He started writing before he finished the composition and then glued corrections onto the page. In one place he did it nine times. When they peeled them, the last version proved to be identical to the first one.

From the video one can understand that Edsger preferred Mozart's style of programming. Not just programming, but Mozart style of doing things. He says that the most important thing has been the daily discipline of neatly writing down his thoughts.

His daily discipline lead to hundreds of crystal clear scientific papers, which have now been archived in EWD Archive.

You are welcome to watch interview with Edsger Dijkstra:

At the beginning of video Dijkstra criticizes current software release methodology. He says that version 1.0 of should be the finished product. I don't think he's right. It's like Tannenbaum saying Torvalds that Linux is obsolete. Also see "Release Early, Release Often."

Edsger Dijkstra's quotes from video:

  • Computer science is no more about computers than astronomy is about telescopes.
  • The competent programmer is fully aware of the limited size of his own skull. He therefore approaches his task in full humility and avoids clever tricks like the plague.
  • We should not introduce errors through sloppiness but systematically keep them out.
  • Program testing can convincingly show the presence of bugs but it is hopelessly inadequate to show their absence.
  • Elegance is not a dispensable luxury but a factor that decides between success and failure.

I found a funny poster of Dijkstra:

Dijkstra - Quick and Dirty

Awk Tips and TricksHi guys and girls, this is the first guest post on my blog. It's written by Waldner from #awk on FreeNode IRC Network. He works as a sysadmin and does shell scripting as a hobby. Waldner will be happy to take any questions about the article. You can ask them in the comments of this post or on IRC.

This article takes a look at ten tips, tricks and pitfalls in Awk programming language. They are mostly taken from the discussions in #awk IRC channel. Here they are:

Be idiomatic!

Update: Mr. Waldner just notified me that he has improved the tips on being idiomatic. See "Idiomatic Awk" on his website!

In this paragraph, we give some hints on how to write more idiomatic (and usually shorter and more efficient) awk programs. Many awk programs you're likely to encounter, especially short ones, make large use of these notions.

Suppose one wants to print all the lines in a file that match some pattern (a kind of awk-grep, if you like). A reasonable first shot is usually something like

awk '{if ($0 ~ /pattern/) print $0}'

That works, but there are a number of things to note.

The first thing to note is that it is not structured according to the awk's definition of a program, which is

condition { actions }

Our program can clearly be rewritten using this form, since both the condition and the action are very clear here:

awk '$0 ~ /pattern/ {print $0}'

Our next step in the perfect awk-ification of this program is to note that /pattern/ is the same as $0 ~ /pattern/. That is, when awk sees a single regular expression used as an expression, it implicitly applies it to $0, and returns success if there is a match. Then we have:

awk '/pattern/ {print $0}'

Now, let's turn our attention to the action part (what's inside braces). print $0 is a redundant statement, since print alone, by default, prints $0.

awk '/pattern/ {print}'

But now we note that, when it finds that a condition is true, and there are no associated actions, awk performs a default action that is (you guessed it) print (which we already know is equivalent to print $0). Thus we can do this:

awk '/pattern/'

Now we have reduced the initial program to its simplest (and more idiomatic) form. In many cases, if all you want to do is print some lines, according to a condition, you can write awk programs composed only of a condition (although complex):

awk '(NR%2 && /pattern/) || (!(NR%2) && /anotherpattern/)'

That prints odd lines that match /pattern/, or even lines that match /anotherpattern/. Naturally, if you don't want to print $0 but instead do something else, then you'll have to manually add a specific action to do what you want.

From the above, it follows that

awk 1
awk '"a"'   # single quotes are important!

are both awk programs that just print their input unchanged. Sometimes, you want to operate only on some lines of the input (according to some condition), but also want to print all the lines, regardless of whether they were affected by your operation or not. A typical example is a program like this:

awk '{sub(/pattern/,"foobar")}1'

This tries to replace "pattern" with "foobar". Whether or not the substitution succeeds, the always-true condition "1" prints each line (you could even use "42", or "19", or any other nonzero value if you want; "1" is just what people traditionally use). This results in a program that does the same job as sed 's/pattern/foobar/'. Here are some examples of typical awk idioms, using only conditions:

awk 'NR % 6'            # prints all lines except those divisible by 6
awk 'NR > 5'            # prints from line 6 onwards (like tail -n +6, or sed '1,5d')
awk '$2 == "foo"'       # prints lines where the second field is "foo"
awk 'NF >= 6'           # prints lines with 6 or more fields
awk '/foo/ && /bar/'    # prints lines that match /foo/ and /bar/, in any order
awk '/foo/ && !/bar/'   # prints lines that match /foo/ but not /bar/
awk '/foo/ || /bar/'    # prints lines that match /foo/ or /bar/ (like grep -e 'foo' -e 'bar')
awk '/foo/,/bar/'       # prints from line matching /foo/ to line matching /bar/, inclusive
awk 'NF'                # prints only nonempty lines (or: removes empty lines, where NF==0)
awk 'NF--'              # removes last field and prints the line
awk '$0 = NR" "$0'      # prepends line numbers (assignments are valid in conditions)

Another construct that is often used in awk is as follows:

awk 'NR==FNR { # some actions; next} # other condition {# other actions}' file1 file2

This is used when processing two files. When processing more than one file, awk reads each file sequentially, one after another, in the order they are specified on the command line. The special variable NR stores the total number of input records read so far, regardless of how many files have been read. The value of NR starts at 1 and always increases until the program terminates. Another variable, FNR, stores the number of records read from the current file being processed. The value of FNR starts from 1, increases until the end of the current file, starts again from 1 as soon as the first line of the next file is read, and so on. So, the condition "NR==FNR" is only true while awk is reading the first file. Thus, in the program above, the actions indicated by "# some actions" are executed when awk is reading the first file; the actions indicated by "# other actions" are executed when awk is reading the second file, if the condition in "# other condition" is met. The "next" at the end of the first action block is needed to prevent the condition in "# other condition" from being evaluated, and the actions in "# other actions" from being executed while awk is reading the first file.

There are really many problems that involve two files that can be solved using this technique. Here are some examples:

# prints lines that are both in file1 and file2 (intersection)
awk 'NR==FNR{a[$0];next} $0 in a' file1 file2

Here we see another typical idiom: a[$0] has the only purpose of creating the array element indexed by $0. During the pass over the first file, all the lines seen are remembered as indexes of the array a. The pass over the second file just has to check whether each line being read exists as an index in the array a (that's what the condition $0 in a does). If the condition is true, the line is printed (as we already know).

Another example. Suppose we have a data file like this

20081010 1123 xxx
20081011 1234 def
20081012 0933 xyz
20081013 0512 abc
20081013 0717 def
...thousand of lines...

where "xxx", "def", etc. are operation codes. We want to replace each operation code with its description. We have another file that maps operation codes to human readable descriptions, like this:

abc withdrawal
def payment
xyz deposit
xxx balance
...other codes...

We can easily replace the opcodes in the data file with this simple awk program, that again uses the two-files idiom:

# use information from a map file to modify a data file
awk 'NR==FNR{a[$1]=$2;next} {$3=a[$3]}1' mapfile datafile

First, the array a, indexed by opcode, is populated with the human readable descriptions. Then, it is used during the reading of the second file to do the replacements. Each line of the datafile is then printed after the substitution has been made.

Another case where the two-files idiom is useful is when you have to read the same file twice, the first time to get some information that can be correctly defined only by reading the whole file, and the second time to process the file using that information. For example, you want to replace each number in a list of numbers with its difference from the largest number in the list:

# replace each number with its difference from the maximum
awk 'NR==FNR{if($0>max) max=$0;next} {$0=max-$0}1' file file

Note that we specify "file file" on the command line, so the file will be read twice.

Caveat: all the programs that use the two-files idiom will not work correctly if the first file is empty (in that case, awk will execute the actions associated to NR==FNR while reading the second file). To correct that, you can reinforce the NR==FNR condition by adding a test that checks that also FILENAME equals ARGV[1].

Pitfall: shorten pipelines

It's not uncommon to see lines in scripts that look like this:

somecommand | head -n +1 | grep foo | sed 's/foo/bar/' | tr '[a-z]' '[A-Z]' | cut -d ' ' -f 2

This is just an example. In many cases, you can use awk to replace parts of the pipeline, or even all of it:

somecommand | awk 'NR>1 && /foo/{sub(/foo/,"bar"); print toupper($2)}'

It would be nice to collect here many examples of pipelines that could be partially or completely eliminated using awk.

Print lines using ranges

Yes, we all know that awk has builtin support for range expressions, like

# prints lines from /beginpat/ to /endpat/, inclusive
awk '/beginpat/,/endpat/'

Sometimes however, we need a bit more flexibility. We might want to print lines between two patterns, but excluding the patterns themselves. Or only including one. A way is to use these:

# prints lines from /beginpat/ to /endpat/, not inclusive
awk '/beginpat/,/endpat/{if (!/beginpat/&&!/endpat/)print}'

# prints lines from /beginpat/ to /endpat/, not including /beginpat/
awk '/beginpat/,/endpat/{if (!/beginpat/)print}'

It's easy to see that there must be a better way to do that, and in fact there is. We can use a flag to keep track of whether we are currently inside the interesting range or not, and print lines based on the value of the flag. Let's see how it's done:

# prints lines from /beginpat/ to /endpat/, not inclusive
awk '/endpat/{p=0};p;/beginpat/{p=1}'

# prints lines from /beginpat/ to /endpat/, excluding /endpat/
awk '/endpat/{p=0} /beginpat/{p=1} p'

# prints lines from /beginpat/ to /endpat/, excluding /beginpat/
awk 'p; /endpat/{p=0} /beginpat/{p=1}'

All these programs just set p to 1 when /beginpat/ is seen, and set p to 0 when /endpat/ is seen. The crucial difference between them is where the bare "p" (the condition that triggers the printing of lines) is located. Depending on its position (at the beginning, in the middle, or at the end), different parts of the desired range are printed. To print the complete range (inclusive), you can just use the regular /beginpat/,/endpat/ expression or use the flag technique, but reversing the order of the conditions and associated patterns:

# prints lines from /beginpat/ to /endpat/, inclusive
awk '/beginpat/{p=1};p;/endpat/{p=0}'

It goes without saying that while we are only printing lines here, the important thing is that we have a way of selecting lines within a range, so you can of course do anything you want instead of printing.

Split file on patterns

Suppose we have a file like this

line1
line2
line3
line4
FOO1
line5
line6
FOO2
line7
line8
FOO3
line9
line10
line11
FOO4
line12
FOO5
line13

We want to split this file on all the occurrences of lines that match /^FOO/, and create a series of files called, for example, out1, out2, etc. File out1 will contain the first 4 lines, out2 will contain "line5" and "line6", etc. There are at least two ways to do that with awk:

# first way, works with all versions of awk
awk -v n=1 '/^FOO[0-9]*/{close("out"n);n++;next} {print > "out"n}' file

Since we don't want to print anything when we see /^FOO/, but only update some administrative data, we use the "next" statement to tell awk to immediately start processing the next record. Lines that do not match /^FOO/ will instead be processed by the second block of code. Note that this method will not create empty files if an empty section is found (eg, if "FOO5\nFOO6" is found, the file "out5" will not be created). The "-v n=1" is used to tell awk that the variable "n" should be initialized with a value of 1, so effectively the first output file will be called "out1".

Another way (which however needs GNU awk to work) is to read one chunk of data at a time, and write that to its corresponding out file.

# another way, needs GNU awk
LC_ALL=C gawk -v RS='FOO[0-9]*\n' -v ORS= '{print > "out"NR}' file

The above code relies on the fact that GNU awk supports assigning a regular expression to RS (the standard only allows a single literal character or an empty RS). That way, awk reads a series of "records", separated by the regular expression matching /FOO[0-9]*\n/ (that is, the whole FOO... line). Since newlines are preserved in each section, we set ORS to empty since we don't want awk to add another newline at the end of a block. This method does create an empty file if an empty section is encountered. On the downside, it's a bit fragile because it will produce incorrect results if the regex used as RS appears somewhere else in the rest of the input.

We will see other examples where gawk's support for regexes as RS is useful. Note that the last program used LC_ALL=C at the beginning...

Locale-based pitfalls

Sometimes awk can behave in an unexpected way if the locale is not C (or POSIX, which should be the same). See for example this input:

-rw-r--r-- 1 waldner users 46592 2003-09-12 09:41 file1
-rw-r--r-- 1 waldner users 11509 2008-10-07 17:42 file2
-rw-r--r-- 1 waldner users 11193 2008-10-07 17:41 file3
-rw-r--r-- 1 waldner users 19073 2008-10-07 17:45 file4
-rw-r--r-- 1 waldner users 36332 2008-10-07 17:03 file5
-rw-r--r-- 1 waldner users 33395 2008-10-07 16:53 file6
-rw-r--r-- 1 waldner users 54272 2008-09-18 16:20 file7
-rw-r--r-- 1 waldner users 20573 2008-10-07 17:50 file8

You'll recognize the familiar output of ls -l here. Let's use a non-C locale, say, en_US.utf8, and try an apparently innocuous operation like removing the first 3 fields.

$ LC_ALL=en_US.utf8 awk --re-interval '{sub(/^([^[:space:]]+[[:space:]]+){3}/,"")}1' file
-rw-r--r-- 1 waldner users 46592 2003-09-12 09:41 file1
-rw-r--r-- 1 waldner users 11509 2008-10-07 17:42 file2
-rw-r--r-- 1 waldner users 11193 2008-10-07 17:41 file3
-rw-r--r-- 1 waldner users 19073 2008-10-07 17:45 file4
-rw-r--r-- 1 waldner users 36332 2008-10-07 17:03 file5
-rw-r--r-- 1 waldner users 33395 2008-10-07 16:53 file6
-rw-r--r-- 1 waldner users 54272 2008-09-18 16:20 file7
-rw-r--r-- 1 waldner users 20573 2008-10-07 17:50 file8

It looks like sub() did nothing. Now change that to use the C locale:

$ LC_ALL=C awk --re-interval '{sub(/^([^[:space:]]+[[:space:]]+){3}/,"")}1' file
users 46592 2003-09-12 09:41 file1
users 11509 2008-10-07 17:42 file2
users 11193 2008-10-07 17:41 file3
users 19073 2008-10-07 17:45 file4
users 36332 2008-10-07 17:03 file5
users 33395 2008-10-07 16:53 file6
users 54272 2008-09-18 16:20 file7
users 20573 2008-10-07 17:50 file8

Now it works. Another localization issue is the behavior of bracket expressions matching, like for example [a-z]:

$ echo 'èòàù' | LC_ALL=en_US.utf8 awk '/[a-z]/'
èòàù

This may or may not be what you want. When in doubt or when facing an apparently inexplicable result, try putting LC_ALL=C before your awk invocation.

Parse CSV

Update: Mr. Waldner just notified me that he has improved this section of the article on his website. See "CVS Parsing With Awk".

This is another thing people do all the time with awk. Simple CSV files (with fields separated by commas, and commas cannot appear anywhere else) are easily parsed using FS=','. There can be spaces around fields, and we don't want them, like eg

    field1  ,   field2   , field3   , field4

Exploiting the fact that FS can be a regex, we could try something like FS='^ *| *, *| *$'. This can be problematic for two reasons:

  • actual data field might end up correponding either to awk's fields 1 ... NF or 2 ... NF, depending on whether the line has leading spaces or not;
  • for some reason, assigning that regex to FS produces unexpected results if fields have embedded spaces (anybody knows why?).

In this case, it's probably better to parse using FS=',' and remove leading and trailing spaces from each field:

# FS=','
for(i=1;i<=NF;i++){
  gsub(/^ *| *$/,"",$i);
  print "Field " i " is " $i;
}

Another common CSV format is

"field1","field2","field3","field4"

Assuming double quotes cannot occur in fields. This is easily parsed using FS='^"|","|"$' (or FS='","|"' if you like), keeping in mind that the actual fields will be in position 2, 3 ... NF-1. We can extend that FS to allow for spaces around fields, like eg

   "field1"  , "field2",   "field3" , "field4"

by using FS='^ *"|" *, *"|" *$'. Usable fields will still be in positions 2 ... NF-1. Unlike the previous case, here that FS regex seems to work fine. You can of course also use FS=',', and remove extra characters by hand:

# FS=','
for(i=1;i<=NF;i++){
  gsub(/^ *"|" *$/,"",$i);
  print "Field " i " is " $i;
}

Another CSV format is similar to the first CSV format above, but allows for field to contain commas, provided that the field is quoted:

 field1, "field2,with,commas"  ,  field3  ,  "field4,foo"

We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):

$0=$0",";                                  # yes, cheating
while($0) {
  match($0,/[^,]*,| *"[^"]*" *,/);            
  sf=f=substr($0,RSTART,RLENGTH);          # save what matched in sf
  gsub(/^ *"?|"? *,$/,"",f);               # remove extra stuff
  print "Field " ++c " is " f;
  sub(sf,"");                              # "consume" what matched
}

As the complexity of the format increases (for example when escaped quotes are allowed in fields), awk solutions become more fragile. Although I should not say this here, for anything more complex than the last example, I suggest using other tools (eg, Perl just to name one). Btw, it looks like there is an awk CSV parsing library here: http://lorance.freeshell.org/csv/ (I have not tried it).

Pitfall: validate an IPv4 address

Let's say we want to check whether a given string is a valid IPv4 address (for simplicity, we limit our discussion to IPv4 addresses in the traditiona dotted quad format here). We start with this seemingly valid program:

awk -F '[.]' 'function ok(n){return (n>=0 && n<=255)} {exit (ok($1) && ok($2) && ok($3) && ok($4))}'

This seems to work, until we pass it '123b.44.22c.3', which it happily accepts as valid. The fact is that, due to the way awk's number to string conversion works, some strings may "look like" numbers to awk, even if we know they are not. The correct thing to do here is to perform a string comparison against a regular expression:

awk -F '[.]' 'function ok(n) {
  return (n ~ /^([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])$/)
}
{exit (ok($1) && ok($2) && ok($3) && ok($4))}'

Check whether two files contain the same data

We want to check whether two (unsorted) files contain the same data, that is, the set of lines of the first file is the same set of lines of the second file. One way is of course sorting the two files and processing them with some other tool (for example, uniq or diff). But we want to avoid the relatively expensive sort operation. Can awk help us here? The answer (you guessed it) is yes. If we know that the two files do not contain duplicates, we can do this:

awk '!($0 in a) {c++;a[$0]} END {exit(c==NR/2?0:1)}' file1 file2

and check the return status of the command (0 if the files are equal, 1 otherwise). The assumption we made that the two files must not contain duplicate lines is crucial for the program to work correctly. In essence, what it does is to keep track of the number of different lines seen. If this number is exactly equal to half the number of total input records seen, then the two files must be equal (in the sense described above). To understand that, just realize that, in all other cases (ie, when a file is only a partial subset or is not a subset of the other), the total number of distinct lines seen will always be greater than NR/2.

The program's complexity is linear in the number of input records.

Pitfall: contexts and variable types in awk

We have this file:

1,2,3,,5,foo
1,2,3,0,5,bar
1,2,3,4,5,baz

and we want to replace the last field with "X" only when the fourth field is not empty. We thus do this:

awk -F ',' -v OFS=',' '{if ($4) $6="X"}1'

But we see that the substitution only happens in the last line, instead of the last two as we expected. Why?

Basically, there are only two data types in awk: strings and numbers. Internally, awk does not assign a fixed type to the variables; they are literally considered to be of type "number" and "string" at the same time, with the number 0 and the null string being equivalent. Only when a variable is used in the program, awk automatically converts it to the type it deems appropriate for the context. Some contexts strictly require a specific type; in that case, awk automatically converts the variable to that type and uses it. In contexts that does not require a specific type, awk treats variables that "look like" numbers as numbers, and the other variables are treated as strings. In out example above, the simple test "if ($4)" does not provide a specific context, since the tested variable can be anything. In the first line, $4 is an empty string, so awk considers it false for the purposes of the test. In the second line, $4 is "0". Since it look like a number, awk uses it like a number, ie zero. Since 0 is considered false, the test is unsuccessful and the substitution is not performed.

Luckily, there is a way to help awk and tell it exactly what we want. We can use string concatenation and append an empty string to the variable (which does not change its value) to explicitly tell awk that we want it to treat it like a string, or, conversely, add 0 to the variable (again, without changing its value) to explicitly tell awk that we want a number. So this is how our program should be to work correctly:

awk -F ',' -v OFS=',' '{if ($4"") $6="X"}1'   # the "" forces awk to evaluate the variable as a string

With this change, in the second line the if sees the string "0", which is not considered false, and the test succeeds, just as we wanted.

As said above, the reverse is also true. Another typical problematic program is this:

awk '/foo/{tot++} END{print tot}'

This, in the author's intention, should count the number of lines that match /foo/. But if /foo/ does not appear in the input, the variable tot retains its default initial value (awk initializes all variables with the dual value "" and 0). print expects a string argument, so awk supplies the value "". The result is that the program prints just an empty line. But we can force awk to treat the variable as numeric, by doing this:

awk '/foo/{tot++} END{print tot+0}'

The seemingly innocuous +0 has the effect of providing numeric context to the variable "tot", so awk knows it has to prefer the value 0 of the variable over the other possible internal value (the empty string). Then, numeric-to-string conversion still happens to satisfy print, but this time what awk converts to string is 0, so print sees the string "0" as argument, and prints it.

Note that, if an explicit context has been provided to a variable, awk remembers that. That can lead to unexpected results:

# input: 2.5943 10
awk '{$1=sprintf("%d",$1);   # truncates decimals, but also explicitly turns $1 into a string!
      if($1 > $2) print "something went wrong!" }          # this is printed

Here, after the sprintf(), awk notes that we want $1 to be a string (in this case, "2"). Then, when we do if($1>$2), awk sees that $2 has no preferred type, while $1 does, so it converts $2 into a string (to match the wanted type of $1) and does a string comparison. Of course, 99.9999% of the times this is not what we want here. In this case, the problem is easily solved by doing "if ($1+0 > $2)" (doing $2+0 instead WON'T work!), doing "$1=$1+0" after the sprintf(), or using some other means to truncate the value of $1, that does not give it explicit string type.

Pulling out things

Suppose you have a file like this:

Yesterday I was walking in =the street=, when I saw =a
black dog=. There was also =a cat= hidden around there. =The sun= was shining, and =the sky= was blue.
I entered =the
music
shop= and I bought two CDs. Then I went to =the cinema= and watched =a very nice movie=.
End of the story.

Ok, silly example, fair enough. But suppose that we want to print only and all the parts of that file that are like =something=. We have no knowledge of the structure of the file. The parts we're interested in might be anywere; they may span lines, or there can be many of them on a single line. This seemingly daunting and difficult task is actually easily accomplished with this small awk program:

awk -v RS='=' '!(NR%2)'
# awk -v RS='=' '!(NR%2){gsub(/\n/," ");print}'    # if you want to reformat embedded newlines

Easy, wasn't it? Let's see how this works. Setting RS to '=' tells awk that records are separated by '=' (instead of the default newline character). If we look at the file as a series of records separated by '=', it becomes clear that what we want are the even-numbered records. So, just throw in a condition that is true for even-numbered records to trigger the printing.

GNU awk can take this technique a step further, since it allows us to assign full regexes to RS, and introduces a companion variable (RT) that stores the part of the input that actually matched the regex in RS. This allows us, for example, to apply the previous technique when the interesting parts of the input are delimited by different characters or string, like for example when we want everything that matches <tag>something</tag>. With GNU awk, we can do this:

gawk -v RS='</?tag>' 'RT=="</tag>"'

or again

gawk -v RS='</?tag>' '!(NR%2)'

and be done with that. Another nice thing that can be done with GNU awk and RT is printing all the parts of a file that match an arbitrary regular expression (something otherwise usually not easily accomplished). Suppose that we want to print everything that looks like a number in a file (simplifiying, here any sequence of digits is considered a number, but of course this can be refined), we can do just this:

gawk -v RS='[0-9]+' 'RT{print RT}'

Checking that RT is not null is necessary because for the last record of a file RT is null, and an empty line would be printed in that case. The output produced by the previous program is similar to what can be obtained using grep -o. But awk can do better than that. We can use a slight variation of this same technique if we want to add context to our search (something grep -o alone cannot do). For example, let's say that we want to print all numbers, but only if they appear inside "--", eg like --1234--, and not otherwise. With gawk, we can do this:

gawk -v RS='--[0-9]+--' 'RT{gsub(/--/,"",RT);print RT}'

So, a carefully crafted RS selects only the "right" data, that can be subsequently extracted safely and printed.

With non-GNU awk, matching all occurrences of an expression can still be done, it just requires more code. See FindAllMatches.

Have fun!

Have fun learning Awk! It's a fun language to know.

Ps. I will go silent for a week. I have an on-site interview with Google in Mountain View, California. I'll be back on 31st of October and will post something new in the first week of November!

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the eighth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lecture twelve, which introduces an efficient search structure called Skip Lists.

Skip lists are an efficient data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforce balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler to implement and significantly faster than the equivalent algorithms for balanced search trees (see lectures 9 and 10 for search trees).

Here is the first publication ever on skip lists by their inventor William Pugh - Skip Lists: A Probabalistic Alternative to Balanced Trees

The lecture begins with Erik Demaine (professor) telling the students that before the lecture he implemented skip lists to see how fast it could be done. It took him 10 minutes to implement linked lists (on which skip lists are based) and took another 30 to implement skip lists themselves (+30 mins debugging ;) ). He says that they are so simple that at no point did he have to think while implementing them. Compare it to red-black trees for example, where you have to think really hard to implement the operations to maintain red-black tree properties (see lecture 10 on red-black trees).

The lecture continues with derivation of skip lists from scratch. First, a single sorted linked list is considered. Searching in a sorted linked list takes O(n). Then another sorted linked list is added. The added linked list is chosen to be a sublist of the first linked list. The elements that both lists share are linked together with another link (see my scanned lecture notes below to see how it looks). The search cost for such structure is O(2·sqrt(n)). Then another and another linked list is added, the search cost goes down to O(3·n1/3), O(4·n1/4), ..., O(k·n1/k). Now, what should k be? Take it to be log(n), we get running time of O(lg(n)·n1/lg(n)), which is O(2·lg(n)) - logarithmic time search structure!

The lecture also presents Search(x) method for finding the element x in a skip list, and Insert(x) method for inserting the element x in a skip list.

The rest of the lecture is allocated to the proof that the search operation in skiplists takes O(lg(n)) time with high probability.

You're welcome to watch lecture twelve:

Topics covered in lecture twelve:

  • [00:10] A new dynamic, balanced, randomized and simple search structure - skip lists.
  • [01:20] Review of other simple search structures - treaps, red-black trees, b-trees.
  • [03:55] Skip lists are a data structure that you will never forget, so simple it is.
  • [06:10] All the skip list operations take O(lg(n)) time almost guaranteed (with high probability).
  • [07:30] Implementing skip lists from scratch.
  • [09:05] Search cost in a single sorted linked list.
  • [10:30] Adding a second linked list, search cost analysis.
  • [11:16] Trick question: what is this sequence 14, 23, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 110, 116, 125? (I won't reveal the answer ;) , you'll have to watch the lecture).
  • [14:35] Building skip list out of this sequence.
  • [17:07] Search(x) operation algorithm for a skip list.
  • [18:40] Example of search operation.
  • [21:20] What elements do go in the second linked list?
  • [25:30] Search cost of a skip list with two sorted linked lists.
  • [27:53] Skip list with three sorted link lists.
  • [29:14] Skip list with k sorted linked lists.
  • [29:25] What should k be? (It should ideally be lg(n)).
  • [32:10] Example of an ideal skip list.
  • [38:30] Skip lists roughly maintain their structure subject to updates (insert, delete).
  • [39:40] Insert(x) operation algorithm for a skip list.
  • [47:00] Building a random skip list.
  • [54:00] Rigorous analysis of search cost of skip lists.
  • [55:00] Theorem: with high probability, search in n element skiplist costs O(lg(n)).
  • [57:10] Definition: with high probability.
  • [01:00:00] Boole's inequality.
  • [01:06:55] Joke: probability that Search(x) algorithm takes more than O(lg(n)) time is smaller than a meteor striking the earth at the same time that your computer has floating point error at the same time earth explodes :D .
  • [01:07:45] Lemma: With high probability number of levels in a skiplist is O(lg(n)).
  • [01:13:00] Cool idea - analyze search time backwards.

Lecture twelve notes:

MIT Algorithms Lecture 12 Notes Thumbnail. Page 1 of 2.
Lecture 12, page 1 of 2.

MIT Algorithms Lecture 12 Notes Thumbnail. Page 2 of 2.
Lecture 12, page 2 of 2.

Have fun building skip lists! The next post will be about two really theoretical lectures on amortized analysis, competitive analysis and self-organizing lists.

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

PPS. This was one of the lectures that did not have a corresponding chapter in the book.

set operationsA while ago I wrote about how I solved the Google Treasure Hunt Puzzle Nr. 4 about prime numbers. I took an unusual approach and solved this problem entirely from the Unix shell. The solution involved finding the intersection between a bunch of files containing numbers. This lead me to an idea to write a post about how to do various set operations from the shell by using common utilities such as sort, uniq, diff, grep, head, tail, comm, and others.

I'll cover the following set operations in this article:

  • Set Membership. Test if an element belongs to a set.
  • Set Equality. Test if two sets contain the same elements.
  • Set Cardinality. Return the number of elements in the set.
  • Subset Test. Test if a given set is a subset of another set.
  • Set Union. Find union of two sets.
  • Set Intersection. Find intersection of two sets.
  • Set Complement. Given two sets A and B, find all elements in A that are not in B.
  • Set Symmetric Difference. Find symmetric difference of two sets.
  • Power Set. Generate all subsets of a set.
  • Set Cartesian Product. Find A x B.
  • Disjoint Set Test. Test if two sets are disjoint.
  • Empty Set Test. Test if a given set is empty.
  • Minimum. Find the smallest element of a set.
  • Maximum. Find the largest element of a set.

Update: I wrote another post about these operations and created a cheat sheet.
Download cheat sheet: set operations in unix shell (.txt)

To illustrate these operations, I created a few random sets to work with. Each set is represented as a file with one element per line. The elements are positive numbers.

First I created two sets A and B with 5 elements each so that I could easily check that the operations really work.

Sets A and B are hand crafted. It's easy to see that only elements 1, 2 and 3 are in common:

$ cat A     $ cat B
3           11
5           1
1           12
2           3
4           2

I also created a set Asub which is a subset of set A and Anotsub which is not a subset of A (to test the Subset Test operation):

$ cat Asub     $ cat Anotsub
3              6
2              7
5              8

Next I created two equal sets Aequal and Bequal again with 5 elements each:

$ cat Aequal     $ cat Bequal
103              100
102              101
101              102
104              103
100              104

Then I created two huge sets Abig and Bbig with 100,000 elements (some of them are repeated, but that's ok).

The easiest way to generate sets Abig and Bbig is to take natural numbers from /dev/urandom. There are two shell commands that can easily do that. The first is "od" and the second is "hexdump".

Here is how to create two files with 100,000 natural numbers with both commands.

With hexdump:

$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig

The "-e" switch specifies a hand-crafted output format. It says take 1 element of size 4 bytes and output it as an unsigned integer. The "-n" switch specifies how many bytes to read, in this case 400000 (400000 bytes / 4 bytes per element = 100000 elements).

With od:

$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig

The "-An" switch specifies that no line address is necessary. The "-w4" switch specifies number of bytes to output per line. The "-tu4" says to output unsigned 4-byte numbers and "-N400000" limits the output to 400000 bytes (400000/4 = 100000 elements). The output from od has to be filtered through sed to drop the leading whitespace characters.

Okay, now let's look at various set operations.

Set Membership

The set membership operation tests if an element belongs to a set. We write aA, if element a belongs to set A, and we write aA, if it does not.

The easiest way to test if an element is in a set is to use "grep" command. Grep searches the file for lines matching a pattern:

$ grep -xc 'element' set

The "-c" flag outputs number of elements in the set. If it is not a multi-set, the number of elements should be 0 or 1. The "-x" option specifies to match the whole line only (no partial matches).

Here is an example of this operation run on set A:

$ grep -xc '4' A
1
$ grep -xc '999' A
0

That's correct. Set A contains element 4 but does not contain element 999.

If the membership operation has to be used from a shell script, the return code from grep can be used instead. Unix commands succeed if the return code is 0, and fail otherwise:

$ grep -xq 'element' set
# returns 0 if element ∈ set
# returns 1 if element ∉ set

The "-q" flag makes sure that grep does not output the element if it is in the set.

Set Equality

The set equality operation tests if two sets are the same, i.e., contain the same elements. We write A = B if sets A and B are equal and AB if they are not.

The easiest way to test if two sets are equal is to use "diff" command. Diff command compares two files for differences. It will find that the order of lines differ, so the files have to be sorted first. If they are multi-sets, the output of sort has to be run through "uniq" command to eliminate duplicate elements:

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# returns 0 if set1 = set2
# returns 1 if set1 ≠ set2

The "-q" flag quiets the output of diff command.

Let's test this operation on sets A, B, Aequal and Bequal:

$ diff -q <(sort A | uniq) <(sort B | uniq)
# return code 1 -- sets A and B are not equal

$ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq)
# return code 0 -- sets A and B are equal

If you have already sorted sets, then just run:

$ diff -q set1 set2

Set Cardinality

The set cardinality operations returns the number of elements in the set. We write |A| to denote the cardinality of the set A.

The simplest way to count the number of elements in a set is to use "wc" command. Wc command counts the number of characters, words or lines in a file. Since each element in the set appears on a new line, counting the number of lines in the file will return the cardinality of the set:

$ wc -l set | cut -d' ' -f1

Cut command is necessary because "wc -l" also outputs the name of the file it was ran on. The cut command outputs the first field which is number of lines in the file.

We can actually get rid of cut:

$ wc -l < set

Let's test if on sets A and Abig:

$ wc -l A | cut -d' ' -f1
5

$ wc -l Abig | cut -d' ' -f1
100000

$ wc -l < A
5

$ wc -l < Abig 
100000

Subset Test

The subset test tests if the given set is a subset of another set. We write SA if S is a subset of A, and SA, if it's not.

I found a very easy way to do it using the "comm" utility. Comm compares two sorted files line by line. It may be run in such a way that it outputs lines that appear only in the first specified file. If the first file is subset of the second, then all the lines in the 1st file also appear in the 2nd, so no output is produced:

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# comm returns no output if subset ⊆ set
# comm outputs something if subset ⊊ set

Please remember that if you have a numeric set, then sort must take "-n" option.

Let's test if Asub is a subset of A:

$ comm -23 <(sort -n Asub|uniq) <(sort -n A|uniq) | head -1
# no output - yes, Asub ⊆ A

Now let's test if Anotsub is a subset of A:

$ comm -23 <(sort -n Anotsub|uniq) <(sort -n A|uniq) | head -1
6 # has output - no, Anotsub ⊊ A

If you want to use it from a shell script, you'd have to test if the output from this command was empty or not.

Set Union

The set union operation unions two sets, i.e., join them into one set. We write C = AB to denote union of sets A and B which produces set C.

Set union is extremely easy to create. Just use the "cat" utility to concatenate two files:

$ cat set1 set2

If the duplicates (elements which are both in set1 and set2) are not welcome, then the output of cat can be filtered via awk:

 
$ cat set1 set2 | awk '!found[$1]++'

# we can also get rid of cat by just using awk:

$ awk '!found[$1]++' set1 set2

If we don't want to use awk, which is a whole-blown programming language, then we can sort the output of cat and filter it via uniq:

$ cat set1 set2 | sort | uniq

# we can get rid of cat by specifying arguments to sort:

$ sort set1 set2 | uniq

# finally we can get rid of uniq by specifying -u flag to sort

$ sort -u set1 set2

If the sets set1 and set2 are already sorted, then the union operation can be made much faster by specifying the "-m" command line option, which merges the files (like the final step of merge-sort algorithm):

$ sort -m set1 set2 | uniq

# or

$ set -um set1 set2

Let's test this operation on sets A and B:

$ cat A B # with duplicates
3
5
1
2
4
11
1
12
3
2

$ awk '!found[$1]++' # without dupes
3
5
1
2
4
11
12

$ sort -n A B | uniq # with sort && uniq
1
2
3
4
5
11
12

Set Intersection

The set intersection operation finds elements that are in both sets at the same time. We write C = AB to denote the intersection of sets A and B, which produces the set C.

There are many ways to do set intersection. The first way that I am going to show you uses "comm":

$ comm -12 <(sort set1) <(sort set2)

The "-12" option to comm directs it to suppress output of lines appearing just in the 1st and the 2nd file and makes it output lines appearing in both 1st and 2nd, which is the intersection of two sets.

Please remember that if you have a numeric set, then sort must take "-n" option.

Another way to do it is to use "grep" utility. I actually found about this method as I was writing this article:

$ grep -xF -f set1 set2

The "-x" option forces grep to match the whole lines (no partial matches). The "-f set1" specifies the patterns to use for searching. The "-F" option makes grep interpret the given patterns literally (no regexes). It works by matching all lines of set1 in set2. The lines that appear just in set1 or just in set2 are never output.

The next way to find intersection is by using "sort" and "uniq":

$ sort set1 set2 | uniq -d

The "-d" option to uniq forces it to print only the duplicate lines. Obviously, if a line appears in set1 and set2, after sorting there will be two consecutive equal lines in the output. The "uniq -d" command prints such repeated lines (but only 1 copy of it), thus it's the intersection operation.

Just a few minutes before publishing this article I found another way to do intersection with "join" command. Join command joins files on a common field:

$ join <(sort -n A) <(sort -n B)

Here is a test run:

$ sort -n A B | uniq -d
1
2
3

$ grep -xF -f A B
1
3
2

$ comm -12 <(sort -n A) <(sort -n B)
1
2
3

Set Complement

The set complement operation finds elements that are in one set but not the other. We write A - B or A \ B to denote set's B complement in set A.

Comm has become a pretty useful command for operating on sets. It can be applied to implement set complement operation as well:

$ comm -23 <(sort set1) <(sort set2)

The option "-23" specifies that comm should not print elements that appear just in set2 and that are common to both. It leaves comm to print elements which are just in set1 (and not in set2).

The "grep" command can also be used to implement this operation:

$ grep -vxF -f set2 set1

Notice that the order of sets has been reversed from that of comm. That's because we are searching those elements in set1, which are not in set2.

Another way to do it is, of course, with "sort" and "uniq":

$ sort set2 set2 set1 | uniq -u

This is a pretty tricky command. Suppose that a line appears in set1 but does not appear in set2. Then it will be output just once and will not get removed by uniq. All other lines get removed.

Let's put these commands to test:

$ comm -23 <(sort -n A) <(sort -n B)
4
5

$ grep -vxF -f B A
5
4

$ sort -n B B A | uniq -u
4
5

Set Symmetric Difference

The set symmetric difference operation finds elements that are in one set, or in the other but not both. We write A Δ B to denote symmetric difference of sets A and B.

The operation can be implemented very easily with "comm" utility:

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'

# sed can be replaced with tr

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'

Here comm is instructed via "-3" not to output fields that are common to both files, but to output fields that are just in set1 and just in set2. Sed is necessary because comm outputs two columns of data and some of it is right padded with a \t tab character.

It can also be done with "sort" and "uniq":

$ sort set1 set2 | uniq -u

We can use mathematics and derive a few formulas involving previously used operations for symmetric difference: A Δ B = (A - B) ∪ (B - A). Now we can use grep:

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
# does (B - A) ∪ (A - B)

# this can be simplified

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

Let's test it:

$ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'
11
12
4
5

$ sort -n A B | uniq -u
4
5
11
12

$ cat <(grep -vxF -f B A) <(grep -vxF -f A B)
5
4
11
12

Power Set

The power set operation generates a power-set of a set. What's a power set? It's a set that contains all subsets of the set. We write P(A) or 2A to denote all subsets of A. For a set with n elements, the power set contains 2n elements.

For example, the power-set of the set { a, b, c } contains 23 = 8 elements. The power-set is { {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} }.

It's not easy to do that with simple Unix tools. I could not think of anything better than a silly Perl solution:

$ perl -le '
sub powset {
 return [[]] unless @_;
 my $head = shift;
 my $list = &powset;
 [@$list, map { [$head, @$_] } @$list]
}
chomp(my @e = <>);
for $p (@{powset(@e)}) {
 print @$p;
}' set

Can you think of a way to do it with Unix tools?

Set Cartesian Product

The set Cartesian product operation produces produces a new set that contains all possible pairs of elements from one set and the other. The notation for Cartesian product of sets A and B is A x B.

For example, if set A = { a, b, c } and set B = { 1, 2 } then the Cartesian product A x B = { (a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2) }.

I can't think of a great solution. I have a very silly solution in bash:

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2

Can you think of other solutions?

Disjoint Set Test

The disjoint set test operation finds if two sets are disjoint, i.e., they do not contain common elements.

Two sets are disjoint if their intersection is the empty set. Any of the set intersection commands (mentioned earlier) can be applied on the sets and the output can be tested for emptiness. If it is empty, then the sets are disjoint, if it is not, then the sets are not disjoint.

Another way to test if two sets are disjoint is to use awk:

$ awk '{ if (++seen[$0]==2) exit 1 }' set1 set2
# returns 0 if sets are disjoint
# returns 1 if sets are not disjoint

It works by counting seen elements in set1 and then set2. If any of the elements appear both in set1 and set2, seen count for that element would be 2 and awk would quit with exit code 1.

Empty Set Test

The empty set test tests if the set is empty, i.e., contains no elements. The empty set is usually written as Ø.

It's very easy to test if the set is empty. The cardinality of an empty set is 0:

$ wc -l set | cut -d' ' -f1
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Getting rid of cut:

$ wc -l < set
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Minimum

The minimum operation returns the smallest number in the set. We write min(A) to denote the minimum operation on the set A.

The minimum element of a set can be found by first sorting it in ascending order and then taking the first element. The first element can be taken with "head" Unix command which outputs the first part of the file:

$ head -1 <(sort set)

The "-1" option specifies to output the first line only.

If the set is already sorted, then it's even simpler:

$ head -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running minimum operation on sets A and Abig:

$ head -1 <(sort -n A)
1
$ head -1 <(sort -n Abig)
2798

Maximum

The maximum operation returns the biggest number in the set. We write max(A) to denote the maximum operation on the set A.

The maximum element of a set can be found by first sorting it in ascending order and then taking the last element. The last element can be taken with "tail" Unix command which outputs the last part of the file:

$ tail -1 <(sort set)

The "-1" option specifies to output the last line only.

If the set is already sorted, then it's even simpler:

$ tail -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running maximum operation on sets A and Abig:

$ tail -1 <(sort -n A)
5
$ head -1 <(sort -n Abig)
4294906714

Have Fun!

Have fun working with these set operations! Thanks to lhunath and waldner from #bash for helping. :)