set operationsA while ago I wrote about how I solved the Google Treasure Hunt Puzzle Nr. 4 about prime numbers. I took an unusual approach and solved this problem entirely from the Unix shell. The solution involved finding the intersection between a bunch of files containing numbers. This lead me to an idea to write a post about how to do various set operations from the shell by using common utilities such as sort, uniq, diff, grep, head, tail, comm, and others.

I'll cover the following set operations in this article:

  • Set Membership. Test if an element belongs to a set.
  • Set Equality. Test if two sets contain the same elements.
  • Set Cardinality. Return the number of elements in the set.
  • Subset Test. Test if a given set is a subset of another set.
  • Set Union. Find union of two sets.
  • Set Intersection. Find intersection of two sets.
  • Set Complement. Given two sets A and B, find all elements in A that are not in B.
  • Set Symmetric Difference. Find symmetric difference of two sets.
  • Power Set. Generate all subsets of a set.
  • Set Cartesian Product. Find A x B.
  • Disjoint Set Test. Test if two sets are disjoint.
  • Empty Set Test. Test if a given set is empty.
  • Minimum. Find the smallest element of a set.
  • Maximum. Find the largest element of a set.

Update: I wrote another post about these operations and created a cheat sheet.
Download cheat sheet: set operations in unix shell (.txt)

To illustrate these operations, I created a few random sets to work with. Each set is represented as a file with one element per line. The elements are positive numbers.

First I created two sets A and B with 5 elements each so that I could easily check that the operations really work.

Sets A and B are hand crafted. It's easy to see that only elements 1, 2 and 3 are in common:

$ cat A     $ cat B
3           11
5           1
1           12
2           3
4           2

I also created a set Asub which is a subset of set A and Anotsub which is not a subset of A (to test the Subset Test operation):

$ cat Asub     $ cat Anotsub
3              6
2              7
5              8

Next I created two equal sets Aequal and Bequal again with 5 elements each:

$ cat Aequal     $ cat Bequal
103              100
102              101
101              102
104              103
100              104

Then I created two huge sets Abig and Bbig with 100,000 elements (some of them are repeated, but that's ok).

The easiest way to generate sets Abig and Bbig is to take natural numbers from /dev/urandom. There are two shell commands that can easily do that. The first is "od" and the second is "hexdump".

Here is how to create two files with 100,000 natural numbers with both commands.

With hexdump:

$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig

The "-e" switch specifies a hand-crafted output format. It says take 1 element of size 4 bytes and output it as an unsigned integer. The "-n" switch specifies how many bytes to read, in this case 400000 (400000 bytes / 4 bytes per element = 100000 elements).

With od:

$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig

The "-An" switch specifies that no line address is necessary. The "-w4" switch specifies number of bytes to output per line. The "-tu4" says to output unsigned 4-byte numbers and "-N400000" limits the output to 400000 bytes (400000/4 = 100000 elements). The output from od has to be filtered through sed to drop the leading whitespace characters.

Okay, now let's look at various set operations.

Set Membership

The set membership operation tests if an element belongs to a set. We write aA, if element a belongs to set A, and we write aA, if it does not.

The easiest way to test if an element is in a set is to use "grep" command. Grep searches the file for lines matching a pattern:

$ grep -xc 'element' set

The "-c" flag outputs number of elements in the set. If it is not a multi-set, the number of elements should be 0 or 1. The "-x" option specifies to match the whole line only (no partial matches).

Here is an example of this operation run on set A:

$ grep -xc '4' A
1
$ grep -xc '999' A
0

That's correct. Set A contains element 4 but does not contain element 999.

If the membership operation has to be used from a shell script, the return code from grep can be used instead. Unix commands succeed if the return code is 0, and fail otherwise:

$ grep -xq 'element' set
# returns 0 if element ∈ set
# returns 1 if element ∉ set

The "-q" flag makes sure that grep does not output the element if it is in the set.

Set Equality

The set equality operation tests if two sets are the same, i.e., contain the same elements. We write A = B if sets A and B are equal and AB if they are not.

The easiest way to test if two sets are equal is to use "diff" command. Diff command compares two files for differences. It will find that the order of lines differ, so the files have to be sorted first. If they are multi-sets, the output of sort has to be run through "uniq" command to eliminate duplicate elements:

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# returns 0 if set1 = set2
# returns 1 if set1 ≠ set2

The "-q" flag quiets the output of diff command.

Let's test this operation on sets A, B, Aequal and Bequal:

$ diff -q <(sort A | uniq) <(sort B | uniq)
# return code 1 -- sets A and B are not equal

$ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq)
# return code 0 -- sets A and B are equal

If you have already sorted sets, then just run:

$ diff -q set1 set2

Set Cardinality

The set cardinality operations returns the number of elements in the set. We write |A| to denote the cardinality of the set A.

The simplest way to count the number of elements in a set is to use "wc" command. Wc command counts the number of characters, words or lines in a file. Since each element in the set appears on a new line, counting the number of lines in the file will return the cardinality of the set:

$ wc -l set | cut -d' ' -f1

Cut command is necessary because "wc -l" also outputs the name of the file it was ran on. The cut command outputs the first field which is number of lines in the file.

We can actually get rid of cut:

$ wc -l < set

Let's test if on sets A and Abig:

$ wc -l A | cut -d' ' -f1
5

$ wc -l Abig | cut -d' ' -f1
100000

$ wc -l < A
5

$ wc -l < Abig 
100000

Subset Test

The subset test tests if the given set is a subset of another set. We write SA if S is a subset of A, and SA, if it's not.

I found a very easy way to do it using the "comm" utility. Comm compares two sorted files line by line. It may be run in such a way that it outputs lines that appear only in the first specified file. If the first file is subset of the second, then all the lines in the 1st file also appear in the 2nd, so no output is produced:

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# comm returns no output if subset ⊆ set
# comm outputs something if subset ⊊ set

Please remember that if you have a numeric set, then sort must take "-n" option.

Let's test if Asub is a subset of A:

$ comm -23 <(sort -n Asub|uniq) <(sort -n A|uniq) | head -1
# no output - yes, Asub ⊆ A

Now let's test if Anotsub is a subset of A:

$ comm -23 <(sort -n Anotsub|uniq) <(sort -n A|uniq) | head -1
6 # has output - no, Anotsub ⊊ A

If you want to use it from a shell script, you'd have to test if the output from this command was empty or not.

Set Union

The set union operation unions two sets, i.e., join them into one set. We write C = AB to denote union of sets A and B which produces set C.

Set union is extremely easy to create. Just use the "cat" utility to concatenate two files:

$ cat set1 set2

If the duplicates (elements which are both in set1 and set2) are not welcome, then the output of cat can be filtered via awk:

 
$ cat set1 set2 | awk '!found[$1]++'

# we can also get rid of cat by just using awk:

$ awk '!found[$1]++' set1 set2

If we don't want to use awk, which is a whole-blown programming language, then we can sort the output of cat and filter it via uniq:

$ cat set1 set2 | sort | uniq

# we can get rid of cat by specifying arguments to sort:

$ sort set1 set2 | uniq

# finally we can get rid of uniq by specifying -u flag to sort

$ sort -u set1 set2

If the sets set1 and set2 are already sorted, then the union operation can be made much faster by specifying the "-m" command line option, which merges the files (like the final step of merge-sort algorithm):

$ sort -m set1 set2 | uniq

# or

$ set -um set1 set2

Let's test this operation on sets A and B:

$ cat A B # with duplicates
3
5
1
2
4
11
1
12
3
2

$ awk '!found[$1]++' # without dupes
3
5
1
2
4
11
12

$ sort -n A B | uniq # with sort && uniq
1
2
3
4
5
11
12

Set Intersection

The set intersection operation finds elements that are in both sets at the same time. We write C = AB to denote the intersection of sets A and B, which produces the set C.

There are many ways to do set intersection. The first way that I am going to show you uses "comm":

$ comm -12 <(sort set1) <(sort set2)

The "-12" option to comm directs it to suppress output of lines appearing just in the 1st and the 2nd file and makes it output lines appearing in both 1st and 2nd, which is the intersection of two sets.

Please remember that if you have a numeric set, then sort must take "-n" option.

Another way to do it is to use "grep" utility. I actually found about this method as I was writing this article:

$ grep -xF -f set1 set2

The "-x" option forces grep to match the whole lines (no partial matches). The "-f set1" specifies the patterns to use for searching. The "-F" option makes grep interpret the given patterns literally (no regexes). It works by matching all lines of set1 in set2. The lines that appear just in set1 or just in set2 are never output.

The next way to find intersection is by using "sort" and "uniq":

$ sort set1 set2 | uniq -d

The "-d" option to uniq forces it to print only the duplicate lines. Obviously, if a line appears in set1 and set2, after sorting there will be two consecutive equal lines in the output. The "uniq -d" command prints such repeated lines (but only 1 copy of it), thus it's the intersection operation.

Just a few minutes before publishing this article I found another way to do intersection with "join" command. Join command joins files on a common field:

$ join <(sort -n A) <(sort -n B)

Here is a test run:

$ sort -n A B | uniq -d
1
2
3

$ grep -xF -f A B
1
3
2

$ comm -12 <(sort -n A) <(sort -n B)
1
2
3

Set Complement

The set complement operation finds elements that are in one set but not the other. We write A - B or A \ B to denote set's B complement in set A.

Comm has become a pretty useful command for operating on sets. It can be applied to implement set complement operation as well:

$ comm -23 <(sort set1) <(sort set2)

The option "-23" specifies that comm should not print elements that appear just in set2 and that are common to both. It leaves comm to print elements which are just in set1 (and not in set2).

The "grep" command can also be used to implement this operation:

$ grep -vxF -f set2 set1

Notice that the order of sets has been reversed from that of comm. That's because we are searching those elements in set1, which are not in set2.

Another way to do it is, of course, with "sort" and "uniq":

$ sort set2 set2 set1 | uniq -u

This is a pretty tricky command. Suppose that a line appears in set1 but does not appear in set2. Then it will be output just once and will not get removed by uniq. All other lines get removed.

Let's put these commands to test:

$ comm -23 <(sort -n A) <(sort -n B)
4
5

$ grep -vxF -f B A
5
4

$ sort -n B B A | uniq -u
4
5

Set Symmetric Difference

The set symmetric difference operation finds elements that are in one set, or in the other but not both. We write A Δ B to denote symmetric difference of sets A and B.

The operation can be implemented very easily with "comm" utility:

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'

# sed can be replaced with tr

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'

Here comm is instructed via "-3" not to output fields that are common to both files, but to output fields that are just in set1 and just in set2. Sed is necessary because comm outputs two columns of data and some of it is right padded with a \t tab character.

It can also be done with "sort" and "uniq":

$ sort set1 set2 | uniq -u

We can use mathematics and derive a few formulas involving previously used operations for symmetric difference: A Δ B = (A - B) ∪ (B - A). Now we can use grep:

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
# does (B - A) ∪ (A - B)

# this can be simplified

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

Let's test it:

$ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'
11
12
4
5

$ sort -n A B | uniq -u
4
5
11
12

$ cat <(grep -vxF -f B A) <(grep -vxF -f A B)
5
4
11
12

Power Set

The power set operation generates a power-set of a set. What's a power set? It's a set that contains all subsets of the set. We write P(A) or 2A to denote all subsets of A. For a set with n elements, the power set contains 2n elements.

For example, the power-set of the set { a, b, c } contains 23 = 8 elements. The power-set is { {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} }.

It's not easy to do that with simple Unix tools. I could not think of anything better than a silly Perl solution:

$ perl -le '
sub powset {
 return [[]] unless @_;
 my $head = shift;
 my $list = &powset;
 [@$list, map { [$head, @$_] } @$list]
}
chomp(my @e = <>);
for $p (@{powset(@e)}) {
 print @$p;
}' set

Can you think of a way to do it with Unix tools?

Set Cartesian Product

The set Cartesian product operation produces produces a new set that contains all possible pairs of elements from one set and the other. The notation for Cartesian product of sets A and B is A x B.

For example, if set A = { a, b, c } and set B = { 1, 2 } then the Cartesian product A x B = { (a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2) }.

I can't think of a great solution. I have a very silly solution in bash:

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2

Can you think of other solutions?

Disjoint Set Test

The disjoint set test operation finds if two sets are disjoint, i.e., they do not contain common elements.

Two sets are disjoint if their intersection is the empty set. Any of the set intersection commands (mentioned earlier) can be applied on the sets and the output can be tested for emptiness. If it is empty, then the sets are disjoint, if it is not, then the sets are not disjoint.

Another way to test if two sets are disjoint is to use awk:

$ awk '{ if (++seen[$0]==2) exit 1 }' set1 set2
# returns 0 if sets are disjoint
# returns 1 if sets are not disjoint

It works by counting seen elements in set1 and then set2. If any of the elements appear both in set1 and set2, seen count for that element would be 2 and awk would quit with exit code 1.

Empty Set Test

The empty set test tests if the set is empty, i.e., contains no elements. The empty set is usually written as Ø.

It's very easy to test if the set is empty. The cardinality of an empty set is 0:

$ wc -l set | cut -d' ' -f1
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Getting rid of cut:

$ wc -l < set
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Minimum

The minimum operation returns the smallest number in the set. We write min(A) to denote the minimum operation on the set A.

The minimum element of a set can be found by first sorting it in ascending order and then taking the first element. The first element can be taken with "head" Unix command which outputs the first part of the file:

$ head -1 <(sort set)

The "-1" option specifies to output the first line only.

If the set is already sorted, then it's even simpler:

$ head -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running minimum operation on sets A and Abig:

$ head -1 <(sort -n A)
1
$ head -1 <(sort -n Abig)
2798

Maximum

The maximum operation returns the biggest number in the set. We write max(A) to denote the maximum operation on the set A.

The maximum element of a set can be found by first sorting it in ascending order and then taking the last element. The last element can be taken with "tail" Unix command which outputs the last part of the file:

$ tail -1 <(sort set)

The "-1" option specifies to output the last line only.

If the set is already sorted, then it's even simpler:

$ tail -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running maximum operation on sets A and Abig:

$ tail -1 <(sort -n A)
5
$ head -1 <(sort -n Abig)
4294906714

Have Fun!

Have fun working with these set operations! Thanks to lhunath and waldner from #bash for helping. :)

This article is part of the article series "Sed One-Liners Explained."
<- previous article next article ->

sed -- the superman of unix stream editingInspired by the success of my "Awk One-Liners Explained" article (30,000 views in first three days), I decided to explain the famous sed one-liners as well.

These one-liners, just like the Awk one-liners, are compiled by Eric Pement.

You may download them here: sed one-liners (link to .txt file).

Most people are only familiar with one particular command of sed, namely the "s" (substitute) comand. s/comand/command/. That is unsatisfactory. Sed has at least 20 different commands for you. You can even write tetris in it (not to mention that it's Turing complete).

My sed learning process was actually identical to Awk learning process. First I went through Bruce Barnett's sed tutorial; then I created a sed cheat sheet; and finally went through sed one-liners. I could not figure one of the one-liners in the file, so I ended up asking for help in comp.unix.shell.

Eric's sed one-liners are divided into several sections:

Update: Spanish translation of part one is available!

I'll divide this article in 3 parts. In the first part I will cover "File spacing", "Numbering" and "Text conversion and substitution". In the second part I will cover "Selective printing of certain lines" and in the third "Selective deletion of certain lines" and "Special applications".

Before I start explaining, I want to share the key idea that changed the way I think about sed. It was the four spaces of sed -- input stream, output stream, pattern space, hold buffer. Sed operates on input stream and produces an output stream. The lines from input stream are placed into the pattern space where they are modified. The hold buffer can be used for temporary storage. These four spaces changed the way I think about sed.

Awesome news: I have written an e-book based on this article series. Check it out:

Grab a copy of sed cheat sheet, print it and let's dive into one-liners!

1. File spacing.

1. Double-space a file.

sed G

This sed one-liner uses the 'G' command. If you grabbed the cheat-sheet you'd see that the command 'G' appends to the pattern space. It appends a newline followed by the contents of hold buffer. In this example the hold buffer is empty all the time (only three commands 'h', 'H' and 'x' modify hold buffer), so we end up simply appending a newline to the pattern space. Once all the commands have been processed (in this case just the 'G' command), sed puts the contents of pattern space to output stream followed by a newline. There we have it, two newlines -- one added by the 'G' command and the other by output stream. File has been double spaced.

2. Double-space a file which already has blank lines in it. Do it so that the output contains no more than one blank line between two lines of text.

sed '/^$/d;G'

Sed allows to restrict commands only to certain lines. This one-liner operates only on lines that match a regular expression /^$/. Which are those? Those are the empty lines. Note that before doing the regular expression match sed pushes the input line to pattern space. When doing it, sed strips the trailing newline character. The empty lines contain just the newline character, so after they have been put into pattern space, this only character has been removed and pattern space stays empty. Regular expression /^$/ matches an empty pattern space and sed applies 'd' command on it, which deletes the current pattern space, reads in the next line, puts it into the pattern space and aborts the current command, and starts the execution from the beginning. The lines which do not match emptiness get a newline character appended by the 'G' command, just like in one-liner #1.

In general sed allows to restrict operations to certain lines (5th, 27th, etc.), to a range of lines (lines 10-20), to lines matching a pattern (lines containing the word "catonmat"), and to lines between two patterns (lines between "catonmat" and "coders").

3. Triple-space a file.

sed 'G;G'

Several sed commands can be combined by separating them by ';'. Such commands get executed one after another. This one-liner does twice what the one-liner #1 does -- appends two newlines (via two 'G' commands) to output.

4. Undo double-spacing.

sed 'n;d'

This one-liner assumes that even-numbered lines are always blank. It uses two new commands - 'n' and 'd'. The 'n' command prints out the current pattern space (unless the '-n' flag has been specified), empties the current pattern space and reads in the next line of input. We assumed that even-numbered lines are always blank. This means that 'n' prints the first, third, fifth, ..., etc. line and reads in the following line. The line following the printed line is always an empty line. Now the 'd' command gets executed. The 'd' command deletes the current pattern space, reads in the next line, puts the new line into the pattern space and aborts the current command, and starts the execution from the first sed command. Now the the 'n' commands gets executed again, then 'd', then 'n', etc.

To make it shorter - 'n' prints out the current line, and 'd' deletes the empty line, thus undoing the double-spacing.

5. Insert a blank line above every line that matches "regex".

sed '/regex/{x;p;x;}'

This one liner uses the restriction operation together with two new commands - 'x' and 'p'. The 'x' command exchanges the hold buffer with the pattern buffer. The 'p' command duplicates input -- prints out the entire pattern space. This one-liner works the following way: a line is read in pattern space, then the 'x' command exchanges it with the empty hold buffer. Next the 'p' command prints out emptiness followed by a newline, so we get an empty line printed before the actual line. Then 'x' exchanges the hold buffer (which now contains the line) with pattern space again. There are no more commands so sed prints out the pattern space. We have printed a newline followed by the line, or saying it in different words, inserted a blank line above every line.

Also notice the { ... }. This is command grouping. It says, execute all the commands in "..." on the line that matches the restriction operation.

6. Insert a blank line below every line that matches "regex".

sed '/regex/G'

This one liner combines restriction operation with the 'G' command, described in one-liner #1. For every line that matches /regex/, sed appends a newline to pattern space. All the other lines that do not match /regex/ just get printed out without modification.

7. Insert a blank line above and below every line that matches "regex".

sed '/regex/{x;p;x;G;}'

This one-liner combines one-liners #5, #6 and #1. Lines matching /regex/ get a newline appended before them and printed (x;p;x from #5). Then they are followed by another newline from the 'G' command (one-liner #6 or #1).

2. Numbering.

8. Number each line of a file (named filename). Left align the number.

sed = filename | sed 'N;s/\n/\t/'

One-liners get trickier and trickier. This one-liner is actually two separate one-liners. The first sed one-liner uses a new command called '='. This command operates directly on the output stream and prints the current line number. There is no way to capture the current line number to pattern space. That's why the second one-liner gets called. The output of first one-liner gets piped to the input of second. The second one-liner uses another new command 'N'. The 'N' command appends a newline and the next line to current pattern space. Then the famous 's///' command gets executed which replaces the newline character just appended with a tab. After these operations the line gets printed out.

To make it clear what '=' does, take a look at this example file:

line one
line two
line three

Running the first one-liner 'sed = filename', produces output:

1
line one
2
line two
3
line three

Now, the 'N' command of the second one-liner joins these lines with a newline character:

1\nline one
2\nline two
3\nline three

The 's/\n/\t/' replaces the newline chars with tabs, so we end up with:

1     line one
2     line two
3     line three

The example is a little inaccurate as line joining with a newline char happens line after line, not on all lines at once.

9. Number each line of a file (named filename). Right align the number.

sed = filename | sed 'N; s/^/     /; s/ *\(.\{6,\}\)\n/\1  /'

This one-liner is also actually two one-liners. The first one liner numbers the lines, just like #8. The second one-liner uses the 'N' command to join the line containing the line number with the actual line. Then it uses two substitute commands to right align the number. The first 's' command 's/^/ /' appends 5 white-spaces to the beginning of line. The second 's' command 's/ *\(.\{6,\}\)\n/\1 /' captures at least six symbols up to a newline and replaces the capture and newline with the back-reference '\1' and two more whitespace to separate line number from the contents of line.

I think it's hard to understand the last part of this sed expression by just reading. Let's look at an example. For clearness I replaced the '\n' newline char with a '@' and whitespace with '-'.

$ echo "-----12@contents" | sed 's/-*\(.\{6,\}\)@/\1--/'
----12--contents

The regular expression '-*\(.\{6,\}\)@' (or just '-*(.{6,})@') tells sed to match some '-' characters followed by at least 6 other characters, followed by a '@' symbol. Sed captures them (remembers them) in \1.

In this example sed matches the first '-' (the '-*' part of regex), then the following six characters "----12" and '@' (the '*\(.\{6,\}\)@' part of regex). Now it replaces the matched part of the string "-----12@" with the contents of captured group which is "----12" plus two extra whitespace. The final result is that "-----12@" gets replaced with "----12--".

10. Number each non-empty line of a file (called filename).

sed '/./=' filename | sed '/./N; s/\n/ /'

This one-liner is again two one-liners. The output of the first one-liner gets piped to the input of second. The first one-liner filters out lines with at least one character in them. The regular expression '/./' says: match lines with at least one char in them. When the empty lines (containing just a newline) get sent to the pattern space, the newline character gets removed, so the empty lines do not get matched. The second one-liner does the same one-liner #8 did, except that only numbered lines get joined and printed out. Command '/./N' makes sure that empty lines are left as-is.

11. Count the number of lines in a file (emulates "wc -l").

sed -n '$='

This one-liner uses a command line switch "-n" to modify sed's behavior. The "-n" switch tells sed not to send the line to output after it has been processed in the pattern space. The only way to make sed output anything with the "-n" switch being on is to use a command that modifies the output stream directly (these commands are '=', 'a', 'c', 'i', 'I', 'p', 'P', 'r' and 'w'). In this one-liner what seems to be the command "$=" is actually a restriction pattern "$" together with the "=" command. The restriction pattern "$" applies the "=" command to the last line only. The "=" command outputs the current line number to standard output. As it is applied to the last line only, this one-liner outputs the number of lines in the file.

3. Text Conversion and Substitution.

12. Convert DOS/Windows newlines (CRLF) to Unix newlines (LF).

sed 's/.$//'

This one-one liner assumes that all lines end with CR+LF (carriage return + line feed) and we are in a Unix environment. Once the line gets read into pattern space, the newline gets thrown away, so we are left with lines ending in CR. The 's/.$//' command erases the last character by matching the last character of the line (regex '.$') and substituting it with nothing. Now when the pattern space gets output, it gets appended the newline and we are left with lines ending with LF.

The assumption about being in a Unix environment is necessary because the newline that gets appended when the pattern space gets copied to output stream is the newline of that environment.

13. Another way to convert DOS/Windows newlines (CRLF) to Unix newlines (LF).

sed 's/^M$//'

This one-liner again assumes that we are in a Unix environment. It erases the carriage return control character ^M. You can usually enter the ^M control char literally by first pressing Ctrl-V (it's control key + v key) and then Ctrl-M.

14. Yet another way to convert DOS/Windows newlines to Unix newlines.

sed 's/\x0D$//'

This one-liner assumes that we are on a Unix machine. It also assumes that we use a version of sed that supports hex escape codes, such as GNU sed. The hex value for CR is 0x0D (13 decimal). This one-liner erases this character.

15-17. Convert Unix newlines (LF) to DOS/Windows newlines (CRLF).

sed "s/$/`echo -e \\\r`/"

This one-liner also assumes that we are in a Unix environment. It calls shell for help. The 'echo -e \\\r' command inserts a literal carriage return character in the sed expression. The sed "s/$/char/" command appends a character to the end of current pattern space.

18. Another way to convert Unix newlines (LF) to DOS/Windows newlines (CRLF).

sed 's/$/\r/'

This one-liner assumes that we use GNU sed. GNU sed is smarter than other seds and can take escape characters in the replace part of s/// command.

19. Convert Unix newlines (LF) to DOS/Windows newlines (CRLF) from DOS/Windows.

sed "s/$//"

This one-liner works from DOS/Windows. It's basically a no-op one-liner. It replaces nothing with nothing and then sends out the line to output stream where it gets CRLF appended.

20. Another way to convert Unix newlines (LF) to DOS/Windows newlines (CRLF) from DOS/Windows.

sed -n p

This is also a no-op one-liner, just like #19. The shortest one-liner which does the same is:

sed ''

21. Convert DOS/Windows newlines (LF) to Unix format (CRLF) from DOS/Windows.

sed "s/\r//"

Eric says that this one-liner works only with UnxUtils sed v4.0.7 or higher. I don't know anything about this version of sed, so let's just trust him. This one-liner strips carriage return (CR) chars from lines. Then when they get output, CRLF gets appended by magic.

Eric mentions that the only way to convert LF to CRLF on a DOS machine is to use tr:

tr -d \r <infile >outfile

22. Delete leading whitespace (tabs and spaces) from each line.

sed 's/^[ \t]*//'

Pretty simple, it matches zero-or-more spaces and tabs at the beginning of the line and replaces them with nothing, i.e. erases them.

23. Delete trailing whitespace (tabs and spaces) from each line.

sed 's/[ \t]*$//'

This one-liner is very similar to #22. It does the same substitution, just matching zero-or-more spaces and tabs at the end of the line, and then erases them.

24. Delete both leading and trailing whitespace from each line.

sed 's/^[ \t]*//;s/[ \t]*$//'

This one liner combines #22 and #23. First it does what #22 does, erase the leading whitespace, and then it does the same as #23, erase trailing whitespace.

25. Insert five blank spaces at the beginning of each line.

sed 's/^/     /'

It does it by matching the null-string at the beginning of line (^) and replaces it with five spaces "     ".

26. Align lines right on a 79-column width.

sed -e :a -e 's/^.\{1,78\}$/ &/;ta'

This one-liner uses a new command line option and two new commands. The new command line option is '-e'. It allows to write a sed program in several parts. For example, a sed program with two substitution rules could be written as "sed -e 's/one/two/' -e 's/three/four'" instead of "sed 's/one/two/;s/three/four'". It makes it more readable. In this one-liner the first "-e" creates a label called "a". The ':' command followed by a name crates a named label. The second "-e" uses a new command "t". The "t" command branches to a named label if the last substitute command modified pattern space. This branching technique can be used to create loops in sed. In this one-liner the substitute command left-pads the string (right aligns it) a single whitespace at a time, until the total length of the string exceeds 78 chars. The "&" in substitution command means the matched string.

Translating it in modern language, it would look like this:

while (str.length() <= 78) {
 str = " " + str
}

27. Center all text in the middle of 79-column width.

sed  -e :a -e 's/^.\{1,77\}$/ & /;ta'

This one-liner is very similar to #26, but instead of left padding the line one whitespace character at a time it pads it on both sides until it has reached length of at least 77 chars. Then another two whitespaces get added at the last iteration and it has grown to 79 chars.

Another way to do the same is

sed  -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/'

This one-liner left pads the string one whitespace char at a time until it has reached length of 78 characters. Then the additional "s/\( *\)\1/\1/" command gets executed which divides the leading whitespace "in half". This effectively centers the string. Unlike the previous one-liner this one-liner does not add trailing whitespace. It just adds enough leading whitespace to center the string.

28. Substitute (find and replace) the first occurrence of "foo" with "bar" on each line.

sed 's/foo/bar/'

This is the simplest sed one-liner possible. It uses the substitute command and applies it once on each line. It substitutes string "foo" with "bar".

29. Substitute (find and replace) the fourth occurrence of "foo" with "bar" on each line.

sed 's/foo/bar/4'

This one-liner uses a flag for the substitute command. With no flags the first occurrence of pattern is changed. With a numeric flag like "/1", "/2", etc. only that occurrence is substituted. This one-liner uses numeric flag "/4" which makes it change fourth occurrence on each line.

30. Substitute (find and replace) all occurrence of "foo" with "bar" on each line.

sed 's/foo/bar/g'

This one-liner uses another flag. The "/g" flag which stands for global. With global flag set, substitute command does as many substitutions as possible, i.e., all.

31. Substitute (find and replace) the first occurrence of a repeated occurrence of "foo" with "bar".

sed 's/\(.*\)foo\(.*foo\)/\1bar\2/'

Let's understand this one-liner with an example:

$ echo "this is foo and another foo quux" | sed 's/\(.*\)foo\(.*foo\)/\1bar\2/'
this is bar and another foo quux

As you can see, this one liner replaced the first "foo" with "bar".

It did it by using two capturing groups. The first capturing group caught everything before the first "foo". In this example it was text "this is ". The second group caught everything after the first "foo", including the second "foo". In this example " and another foo". The matched text was then replaced with contents of first group "this is " followed by "bar" and contents of second group " and another foo". Since " quux" was not part of the match it was left unchanged. Joining these parts the resulting string is "this is bar and another foo quux", which is exactly what we got from running the one-liner.

32. Substitute (find and replace) only the last occurrence of "foo" with "bar".

sed 's/\(.*\)foo/\1bar/'

This one-liner uses a capturing group that captures everything up to "foo". It replaces the captured group and "foo" with captured group itself (the \1 back-reference) and "bar". It results in the last occurrence of "foo" getting replaced with "bar".

33. Substitute all occurrences of "foo" with "bar" on all lines that contain "baz".

sed '/baz/s/foo/bar/g'

This one-liner uses a regular expression to restrict the substitution to lines matching "baz". The lines that do not match "baz" get simply printed out, but those that do match "baz" get the substitution applied.

34. Substitute all occurrences of "foo" with "bar" on all lines that DO NOT contain "baz".

sed '/baz/!s/foo/bar/g'

Sed commands can be inverted and applied on lines that DO NOT match a certain pattern. The exclamation "!" before a sed commands does it. In this one-liner the substitution command is applied to the lines that DO NOT match "baz".

35. Change text "scarlet", "ruby" or "puce" to "red".

sed 's/scarlet/red/g;s/ruby/red/g;s/puce/red/g'

This one-liner just uses three consecutive substitution commands. The first replaces "scarlet" with "red", the second replaced "ruby" with "red" and the last one replaces "puce" with "red".

If you are using GNU sed, then you can do it simpler:

gsed 's/scarlet\|ruby\|puce/red/g'

GNU sed provides more advanced regular expressions which support alternation. This one-liner uses alternation and the substitute command reads "replace 'scarlet' OR 'ruby' OR 'puce' with 'red'".

36. Reverse order of lines (emulate "tac" Unix command).

sed '1!G;h;$!d'

This one-liner acts as the "tac" Unix utility. It's tricky to explain. The easiest way to explain it is by using an example.

Let's use a file with just 3 lines:

$ cat file
foo
bar
baz

Running this one-liner on this file produces the file in reverse order:

$ sed '1!G;h;$!d' file
baz
bar
foo

The first one-liner's command "1!G" gets applied to all the lines which are not the first line. The second command "h" gets applied to all lines. The third command "$!d" gets applied to all lines except the last one.

Let's go through the execution line by line.

Line 1: Only the "h" command gets applied for the first line "foo". It copies this line to hold buffer. Hold buffer now contains "foo". Nothing gets output as the "d" command gets applied.
Line 2: The "G" command gets applied. It appends the contents of hold buffer to pattern space. The pattern space now contains. "bar\nfoo". The "h" command gets applied, it copies "bar\nfoo" to hold buffer. It now contains "bar\nfoo". Nothing gets output.
Line 3: The "G" command gets applied. It appends hold buffer to the third line. The pattern space now contains "baz\nbar\nfoo". As this was the last line, "d" does not get applied and the contents of pattern space gets printed. It's "baz\nbar\nfoo". File got reversed.

If we had had more lines, they would have simply get appended to hold buffer in reverse order.

Here is another way to do the same:

sed -n '1!G;h;$p'

It silences the output with "-n" switch and forces the output with "p" command only at the last line.

These two one-liners actually use a lot of memory because they keep the whole file in hold buffer in reverse order before printing it out. Avoid these one-liners for large files.

37. Reverse a line (emulates "rev" Unix command).

sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'

This is a very complicated one-liner. I had trouble understanding it the first time I saw it and ended up asking on comp.unix.shell for help.

Let's re-format this sed one-liner:

 sed '
   /\n/ !G
   s/\(.\)\(.*\n\)/&\2\1/
   //D
   s/.//
 ' 

The first line "/\n/ !G" appends a newline to the end of the pattern space if there was none.

The second line "s/\(.\)\(.*\n\)/&\2\1/" is a simple s/// expression which groups the first character as \1 and all the others as \2. Then it replaces the whole matched string with "&\2\1", where "&" is the whole matched text ("\1\2"). For example, if the input string is "1234" then after the s/// expression, it becomes "1234\n234\n1".

The third line is "//D". This statement is the key in this one-liner. An empty pattern // matches the last existing regex, so it's exactly the same as: /\(.\)\(.*\n\)/D. The "D" command deletes from the start of the input till the first newline and then resumes editing with first command in script. It creates a loop. As long as /\(.\)\(.*\n\)/ is satisfied, sed will resume all previous operations. After several loops, the text in the pattern space becomes "\n4321". Then /\(.\)\(.*\n\)/ fails and sed goes to the next command.

The fourth line "s/.//" removes the first character in the pattern space which is the newline char. The contents in pattern space becomes "4321" -- reverse of "1234".

There you have it, a line has been reversed.

38. Join pairs of lines side-by-side (emulates "paste" Unix command).

sed '$!N;s/\n/ /'

This one-liner joins two consecutive lines with the "N" command. They get joined with a "\n" character between them. The substitute command replaces this newline with a space, thus joining every pair of lines with a whitespace.

39. Append a line to the next if it ends with a backslash "\".

sed -e :a -e '/\\$/N; s/\\\n//; ta'

The first expression ':a' creates a named label "a". The second expression looks to see if the current line ends with a backslash "\". If it does, it joins it with the line following it using the "N" command. Then the slash and the newline between joined lines get erased with "s/\\\n//" command. If the substitution was successful we branch to the beginning of expression and do the same again, in hope that we might have another backslash. If the substitution was not successful, the line did not end with a backslash and we print it out.

Here is an example of running this one-liner:

$ cat filename
line one \
line two
line three
$ sed -e :a -e '/\\$/N; s/\\\n//; ta' filename
line one line two
line three

Lines one and two got joined because the first line ended with backslash.

40. Append a line to the previous if it starts with an equal sign "=".

sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'

This one-liner also starts with creating a named label "a". Then it tests to see if it is not the last line and appends the next line to the current one with "N" command. If the just appended line starts with a "=", one-liner branches the label "a" to see if there are more lines starting with "=". During this process a substitution gets executed which throws away the newline character which came from joining with "N" and the "=". If the substitution fails, one-liner prints out the pattern space up to the newline character with the "P" command, and deletes the contents of pattern space up to the newline character with "D" command, and repeats the process.

Here is an example of running it:

$ cat filename
line one
=line two
=line three
line four
$ sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D' filename
line one line two line three
line four

Lines one, two and three got joined, because lines two and three started with '='. Line four got printed as-is.

41. Digit group (commify) a numeric string.

sed -e :a -e 's/\(.*[0-9]\)\([0-9]\{3\}\)/\1,\2/;ta'

This one-liner turns a string of digits, such as "1234567" to "1,234,567". This is called commifying or digit grouping.

First the one-liner creates a named label "a". Then it captures two groups of digits. The first group is all the digits up to last three digits. The last three digits gets captures in the 2nd group. Then the two matching groups get separated by a comma. Then the same rules get applied to the line again and again until all the numbers have been grouped in groups of three.

Substitution command "\1,\2" separates contents of group one with a comma from the contents of group two.

Here is an example to understand the grouping happening here better. Suppose you have a numeric string "1234567". The first group captures all the numbers until the last three "1234". The second group captures last three numbers "567". They get joined by a comma. Now the string is "1234,567". The same stuff is applied to the string again. Number "1" gets captured in the first group and the numbers "234" in the second. The number string is "1,234,567". Trying to apply the same rules again fail because there is just one digit at the beginning of string, so the string gets printed out and sed moves on to the next line.

If you have GNU sed, you can use a simpler one-liner:

gsed ':a;s/\B[0-9]\{3\}\>/,&/;ta'

This one-liner starts with creating a named label "a" and then loops over the string the same way as the previous one-liner did. The only difference is how groups of three digits get matched. GNU sed has some additional patterns. There are two patterns that make this one-liner work. The first is "\B", which matches anywhere except at a word boundary. It's needed so we did not go beyond word boundary. Look at this example:

$ echo "12345 1234 123" | sed 's/[0-9]\{3\}\>/,&/g'
12,345 1,234 ,123

It's clearly wrong. The last 123 got a comma added. Adding the "\B" makes sure we match the numbers only at word boundary:

$ echo "12345 1234 123" | sed 's/\B[0-9]\{3\}\>/,&/g'
12,345 1,234 123

The second is "\>". It matches the null string at the end of a word. It's necessary because we need to to match the right-most three digits. If we did not have it, the expression would match after the first digit.

42. Add commas to numbers with decimal points and minus signs.

gsed -r ':a;s/(^|[^0-9.])([0-9]+)([0-9]{3})/\1\2,\3/g;ta'

This one-liner works in GNU sed only. It turns on extended regular expression support with the "-r" switch. Then it loops over a line matching three groups and separates the first two from the third with a comma.

The first group makes sure we ignore a leading non-digit character, such as + or -. If there is no leading non-digit character, then it just anchors at the beginning of the string which always matches.

The second group matches a bunch of numbers. The third group makes sure the second group does not match too many. It matches 3 consecutive numbers at the end of the string.

Once the groups have been captured, the "\1\2,\3" substitution is done and the expression is looped again, until the whole string has been commified.

43. Add a blank line after every five lines.

sed 'n;n;n;n;G;'

The "n" command is called four times in this one-liner. Each time it's called it prints out the current pattern space, empties it and reads in the next line of input. After calling it four times, the fifth line is read into the pattern space and then the "G" command gets called. The "G" command appends a newline to the fifth line. Then the next round of four "n" commands is done. Next time the first "n" command is called it prints out the newlined fifth line, thus inserting a blank line after every 5 lines.

The same can be achieved with GNU sed's step extension:

gsed '0~5G'

GNU sed's step extensions can be generalized as "first~step". It matches every "step"'th line starting with line "first". In this one-liner it matches every 5th line starting with line 0.

Sed One-Liners Explained E-Book

I have written an e-book called "Sed One-Liners Explained". I improved the explanations of the one-liners in this article series, added new one-liners and added three new chapters - an introduction to sed, a summary of sed addresses and ranges, and debugging sed scripts with sed-sed. Please take a look:

Have fun!

Have fun with sed, the superman of unix stream editing. The second part of this article will be on "Selective printing of certain lines." I hope to see you on my blog again for the 2nd part of the article! :)

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the seventh post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lecture eleven, which is on the topic of Augmenting Data Structures.

There are some programming situations that can be perfectly solved with standard data structures such as a linked lists, hash tables, or binary search trees. Many others require a dash of creativity. Only in rare situations will you need to create an entirely new type of data structure, though. More often, it will suffice to augment (to modify) an existing data structure by storing additional information in it. You can then program new operations for the data structure to support the desired application. Augmenting a data structure is not always straightforward, however, since the added information must be updated and maintained by the ordinary operations on the data structure.

This lecture discusses two data structures that are constructed by augmenting red-black trees (see the previous post on red-black trees).

The first part of the lecture describes a data structure that supports general order-statistic operations on a dynamic set. It's called dynamic order statistics. The notion of order statistics was introduced in lecture six. In lecture six it was shown that any order statistic could be retrieved in O(n) time from an unordered set. In this lecture it is shown how red-black trees can be modified so that any order statistic can be determined in O(lg(n)) time. It presents two algorithms OS-Select(i), which returns i-th smallest item in a dynamic set, and OS-Rank(x), which returns rank (position) of element x in sorted order.

The lecture continues with general methodology of how to augment a data structure. Augmenting a data structure can be broken into four steps:

  • 1. Choosing an underlying data structure,
  • 2. Determining additional information to be maintained in the underlying data structure,
  • 3. Verifying that the additional information can be maintained for the basic modifying operations (insert, delete, rotate, etc.) on the underlying data structure, and
  • 4. Developing new operations.

The second part of the lecture applies this methodology to construct a data structure called interval trees. This data structure maintains a dynamic set of elements, with each element x containing an interval. Interval is simply pair of numbers (low, high). For example, a time interval from 3 o'clock to 7 o'clock is a pair (3, 7).

Lecture gives an algorithm called Interval-Search(x), which given a query interval x, quickly finds an interval in the set that overlaps it. Time complexity of this algorithm is O(lg(n)).

The lecture ends with the correctness proof of Interval-Search(x) algorithm.

You're welcome to watch lecture eleven:

Topics covered in lecture eleven:

  • [00:20] Concept of augmenting data structures.
  • [02:00] Dynamic order statistics.
  • [02:20] OS-Select operation on dynamic order statistics.
  • [02:50] OS-Rank operation on dynamic order statistics.
  • [03:49] Dynamic order statistics key idea - keep the sizes of subtrees in nodes of a red-black tree.
  • [04:10] Example of a tree representing dynamic order statistic.
  • [10:10] OS-Select algorithm.
  • [16:40] Analysis of OS-Select.
  • [17:30] OS-Rank algorithm.
  • [20:15] Modifying operations of dynamic order statistics tree.
  • [22:55] Example of inserting an element into the tree.
  • [26:11] Example of rotating a tree.
  • [29:30] Methodology of data structure augmentation.
  • [36:45] Data structure augmentation applied to construct interval trees.
  • [37:31] Example of time-intervals.
  • [39:48] Query operation on interval trees - find an interval in the set that overlaps a given query interval.
  • [41:15] Step 1, underlying data structure: red-black tree keyed on low endpoint.
  • [45:10] Step 2, additional node information: largest value in the subtree rooted at that node.
  • [50:24] Step 3, modifying ops: insert, delete.
  • [56:55] Step 4, new ops: Interval-Search.
  • [01:00:00] Example of Interval-Search algorithm.
  • [01:06:30] Running time of Interval-Search -- O(lg(n)).
  • [01:07:20] List all overlaps (k of them) in O(k*lg(n)) time.
  • [01:08:50] Best algorithm to find all overlaps to date os O(k + lg(n)).
  • [01:09:11] Correctness proof of Interval-Search algorithm.

Lecture eleven notes:

MIT Algorithms Lecture 11 Notes Thumbnail. Page 1 of 2.
Lecture 11, page 1 of 2.

MIT Algorithms Lecture 11 Notes Thumbnail. Page 2 of 2.
Lecture 11, page 2 of 2.

Have fun augmenting data structures! The next post will be about a simple and efficient search structure called skip list!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

This article is part of the article series "Awk One-Liners Explained."
<- previous article next article ->

awk programming one-liners explainedI noticed that Eric Wendelin wrote an article "awk is a beautiful tool." In this article he said that it was best to introduce Awk with practical examples. I totally agree with Eric.

When I was learning Awk, I first went through Awk - A Tutorial and Introduction by Bruce Barnett, which was full of examples to try out; then I created an Awk cheat sheet to have the language reference in front of me; and finally I went through the famous Awk one-liners (link to .txt file), which were compiled by Eric Pement.

This is going to be a three-part article in which I will explain every single one-liner in Mr. Pement's compilation. Each part will explain around 20 one-liners. If you follow closely then the explained examples will turn you into a great Awk programmer.

Eric Pement's Awk one-liner collection consists of five sections:

The first part of the article will explain the first two sections: "File spacing" and "Numbering and calculations." The second part will explain "Text conversion and substitution", and the last part "Selective printing/deleting of certain lines."

I recommend that you print out my Awk cheat sheet before you proceed. This way you will have the language reference in front of you, and you will memorize things better.

These one-liners work with all versions of awk, such as nawk (AT&T's new awk), gawk (GNU's awk), mawk (Michael Brennan's awk) and oawk (old awk).

Awesome news: I have written an e-book based on this article series. Check it out:

Let's start!

1. Line Spacing

1. Double-space a file.

awk '1; { print "" }'

So how does it work? A one-liner is an Awk program and every Awk program consists of a sequence of pattern-action statements "pattern { action statements }". In this case there are two statements "1" and "{ print "" }". In a pattern-action statement either the pattern or the action may be missing. If the pattern is missing, the action is applied to every single line of input. A missing action is equivalent to '{ print }'. Thus, this one-liner translates to:

awk '1 { print } { print "" }'

An action is applied only if the pattern matches, i.e., pattern is true. Since '1' is always true, this one-liner translates further into two print statements:

awk '{ print } { print "" }'

Every print statement in Awk is silently followed by an ORS - Output Record Separator variable, which is a newline by default. The first print statement with no arguments is equivalent to "print $0", where $0 is a variable holding the entire line. The second print statement prints nothing, but knowing that each print statement is followed by ORS, it actually prints a newline. So there we have it, each line gets double-spaced.

2. Another way to double-space a file.

awk 'BEGIN { ORS="\n\n" }; 1'

BEGIN is a special kind of pattern which is not tested against the input. It is executed before any input is read. This one-liner double-spaces the file by setting the ORS variable to two newlines. As I mentioned previously, statement "1" gets translated to "{ print }", and every print statement gets terminated with the value of ORS variable.

3. Double-space a file so that no more than one blank line appears between lines of text.

awk 'NF { print $0 "\n" }'

The one-liner uses another special variable called NF - Number of Fields. It contains the number of fields the current line was split into. For example, a line "this is a test" splits in four pieces and NF gets set to 4. The empty line "" does not split into any pieces and NF gets set to 0. Using NF as a pattern can effectively filter out empty lines. This one liner says: "If there are any number of fields, print the whole line followed by newline."

4. Triple-space a file.

awk '1; { print "\n" }'

This one-liner is very similar to previous ones. '1' gets translated into '{ print }' and the resulting Awk program is:

awk '{ print; print "\n" }'

It prints the line, then prints a newline followed by terminating ORS, which is newline by default.

2. Numbering and Calculations

5. Number lines in each file separately.

awk '{ print FNR "\t" $0 }'

This Awk program appends the FNR - File Line Number predefined variable and a tab (\t) before each line. FNR variable contains the current line for each file separately. For example, if this one-liner was called on two files, one containing 10 lines, and the other 12, it would number lines in the first file from 1 to 10, and then resume numbering from one for the second file and number lines in this file from 1 to 12. FNR gets reset from file to file.

6. Number lines for all files together.

awk '{ print NR "\t" $0 }'

This one works the same as #5 except that it uses NR - Line Number variable, which does not get reset from file to file. It counts the input lines seen so far. For example, if it was called on the same two files with 10 and 12 lines, it would number the lines from 1 to 22 (10 + 12).

7. Number lines in a fancy manner.

awk '{ printf("%5d : %s\n", NR, $0) }'

This one-liner uses printf() function to number lines in a custom format. It takes format parameter just like a regular printf() function. Note that ORS does not get appended at the end of printf(), so we have to print the newline (\n) character explicitly. This one right-aligns line numbers, followed by a space and a colon, and the line.

8. Number only non-blank lines in files.

awk 'NF { $0=++a " :" $0 }; { print }'

Awk variables are dynamic; they come into existence when they are first used. This one-liner pre-increments variable 'a' each time the line is non-empty, then it appends the value of this variable to the beginning of line and prints it out.

9. Count lines in files (emulates wc -l).

awk 'END { print NR }'

END is another special kind of pattern which is not tested against the input. It is executed when all the input has been exhausted. This one-liner outputs the value of NR special variable after all the input has been consumed. NR contains total number of lines seen (= number of lines in the file).

10. Print the sum of fields in every line.

awk '{ s = 0; for (i = 1; i <= NF; i++) s = s+$i; print s }'

Awk has some features of C language, like the for (;;) { ... } loop. This one-liner loops over all fields in a line (there are NF fields in a line), and adds the result in variable 's'. Then it prints the result out and proceeds to the next line.

11. Print the sum of fields in all lines.

awk '{ for (i = 1; i <= NF; i++) s = s+$i }; END { print s+0 }'

This one-liner is basically the same as #10, except that it prints the sum of all fields. Notice how it did not initialize variable 's' to 0. It was not necessary as variables come into existence dynamically. Also notice how it calls "print s+0" and not just "print s". It is necessary if there are no fields. If there are no fields, "s" never comes into existence and is undefined. Printing an undefined value does not print anything (i.e. prints just the ORS). Adding a 0 does a mathematical operation and undef+0 = 0, so it prints "0".

12. Replace every field by its absolute value.

awk '{ for (i = 1; i <= NF; i++) if ($i < 0) $i = -$i; print }'

This one-liner uses two other features of C language, namely the if (...) { ... } statement and omission of curly braces. It loops over all fields in a line and checks if any of the fields is less than 0. If any of the fields is less than 0, then it just negates the field to make it positive. Fields can be addresses indirectly by a variable. For example, i = 5; $i = 'hello', sets field number 5 to string 'hello'.

Here is the same one-liner rewritten with curly braces for clarity. The 'print' statement gets executed after all the fields in the line have been replaced by their absolute values.

awk '{
  for (i = 1; i <= NF; i++) {
    if ($i < 0) {
      $i = -$i;
    }
  }
  print
}'

13. Count the total number of fields (words) in a file.

awk '{ total = total + NF }; END { print total+0 }'

This one-liner matches all the lines and keeps adding the number of fields in each line. The number of fields seen so far is kept in a variable named 'total'. Once the input has been processed, special pattern 'END { ... }' is executed, which prints the total number of fields. See 11th one-liner for explanation of why we "print total+0" in the END block.

14. Print the total number of lines containing word "Beth".

awk '/Beth/ { n++ }; END { print n+0 }'

This one-liner has two pattern-action statements. The first one is '/Beth/ { n++ }'. A pattern between two slashes is a regular expression. It matches all lines containing pattern "Beth" (not necessarily the word "Beth", it could as well be "Bethe" or "theBeth333"). When a line matches, variable 'n' gets incremented by one. The second pattern-action statement is 'END { print n+0 }'. It is executed when the file has been processed. Note the '+0' in 'print n+0' statement. It forces '0' to be printed in case there were no matches ('n' was undefined). Had we not put '+0' there, an empty line would have been printed.

15. Find the line containing the largest (numeric) first field.

awk '$1 > max { max=$1; maxline=$0 }; END { print max, maxline }'

This one-liner keeps track of the largest number in the first field (in variable 'max') and the corresponding line (in variable 'maxline'). Once it has looped over all lines, it prints them out. Warning: this one-liner does not work if all the values are negative.

Here is the fix:

awk 'NR == 1 { max = $1; maxline = $0; next; } $1 > max { max=$1; maxline=$0 }; END { print max, maxline }'

16. Print the number of fields in each line, followed by the line.

awk '{ print NF ":" $0 } '

This one-liner just prints out the predefined variable NF - Number of Fields, which contains the number of fields in the line, followed by a colon and the line itself.

17. Print the last field of each line.

awk '{ print $NF }'

Fields in Awk need not be referenced by constants. For example, code like 'f = 3; print $f' would print out the 3rd field. This one-liner prints the field with the value of NF. $NF is last field in the line.

18. Print the last field of the last line.

awk '{ field = $NF }; END { print field }'

This one-liner keeps track of the last field in variable 'field'. Once it has looped all the lines, variable 'field' contains the last field of the last line, and it just prints it out.

Here is a better version of the same one-liner. It's more common, idiomatic and efficient:

awk 'END { print $NF }'

19. Print every line with more than 4 fields.

awk 'NF > 4'

This one-liner omits the action statement. As I noted in one-liner #1, a missing action statement is equivalent to '{ print }'.

20. Print every line where the value of the last field is greater than 4.

awk '$NF > 4'

This one-liner is similar to #17. It references the last field by NF variable. If it's greater than 4, it prints it out.

Awk one-liners explained e-book

I have written my first e-book called "Awk One-Liners Explained". I improved the explanations of the one-liners in this article series, added new one-liners and added three new chapters - introduction to awk one-liners, summary of awk special variables and idiomatic awk. Please take a look:

Have fun!

That's it for Part I one the article. The second part will be on "Text conversion and substitution."

Have fun learning Awk! It's a fun language to know. :)

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the sixth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lectures nine and ten, which are on the topic of Search Trees.

Search tree data structures provide many dynamic-set operations such as search, insert, delete, minimum element, maximum element and others. The complexity of these operations is proportional to the height of the tree. The better we can balance the tree, the faster they will be. Lectures nine and ten discuss two such structures.

Lecture nine discusses randomly built binary search trees. A randomly built binary search tree is a binary tree that arises from inserting the keys in random order into an initially empty tree. The key result shown in this lecture is that the height of this tree is O(lg(n)).

Lecture ten discusses red-black trees. A red-black tree is a binary search tree with extra bit of information at each node -- it's color, which can be either red or black. By contrasting the way nodes are colored on any path from the root to a leaf, red-black trees ensure that the tree is balanced, giving us guarantees that the operations on this tree will run on O(lg(n)) time!

PS. Sorry for being silent for the past two weeks. I am preparing for job interviews at a company starting with 'G' and it is taking all my time. ;)

Lecture 9: Randomly Built Binary Search Trees

Lecture nine starts with an example of good and bad binary search tree. Given a binary tree with n nodes, a good trees has height log(n) but the bad one has height close to n. As the basic operations on trees run in time proportional to the height of the tree, it's recommended that we build the good trees and not the bad ones.

Before discussing randomly built binary search trees, professor Erik Demaine shows another sorting algorithm. It's called binary search tree sort (BST-sort). It's amazingly simple -- given an array of n items to sort, build a BST out of it and do an in-order tree walk on it. In-order tree walk walks the left branch first, then prints the values, and then walks the right branch. Can you see why the printed list of values is sorted? (If not see the lecture ;) ) [part three of the article series covers sorting algorithms]

Turns out that there is a relation between BST-sort and quicksort algorithm. BST-sort and quicksort make the same comparisons but in different order. [more info on quicksort in part two of article series and in "three beautiful quicksorts" post]

After this discussion, the lecture finally continues with randomized BST-sort which leads to idea of randomly built BSTs.

The other half of the lecture is devoted to a complicated proof of the expected height of a randomly built binary search tree. The result of this proof is that the expected height is order log(n).

You're welcome to watch lecture nine:

Topics covered in lecture nine:

  • [00:50] Good and bad binary search trees (BSTs).
  • [02:00] Binary search tree sort tree algorithm.
  • [03:45] Example of running BST-sort on array (3, 1, 8, 2, 6, 7, 5).
  • [05:45] Running time analysis of BST-sort algorithm.
  • [11:45] BST-sort relation to quicksort algorithm.
  • [16:05] Randomized BST-sort.
  • [19:00] Randomly built binary search trees.
  • [24:58] Theorem: expected height of a rand BST tree is O(lg(n)).
  • [26:45] Proof outline.
  • [32:45] Definition of convex function.
  • [46:55] Jensen's inequality.
  • [55:55] Expected random BST height analysis.

Lecture nine notes:

MIT Algorithms Lecture 9 Notes Thumbnail. Page 1 of 2.
Lecture 9, page 1 of 2.

MIT Algorithms Lecture 9 Notes Thumbnail. Page 2 of 2.
Lecture 9, page 2 of 2.

Lecture 10: Red-Black Trees

Lecture ten begins with a discussion of balanced search trees. Balanced search tree is search tree data structure maintain a dynamic set of n elements using tree of height log(n).

There are many balanced search tree data structures. For example: AVL trees (invented in 1964), 2-3 trees (invented in 1970), 2-3-4 trees, B-trees, red-black trees, skiplists, treaps.

This lecture focuses exclusively on red-black trees.

Red-black trees are binary search trees with extra color field for each node. They satisfy red-black properties:

  • Every node is either red or black.
  • The root and leaves are black.
  • Every red node has a black parent.
  • All simple paths from a node to x to a descendant leaf of x have same number of black nodes = black-height(x).

The lecture gives a proof sketch of the height of an RB-tree and discusses running time of queries (search, min, max, successor, predecessor operations) and then goes into details of update operations (insert, delete). Along the way rotations on a tree are defined, the right-rotate and left-rotate ops.

The other half of the lecture looks at Red-Black-Insert operation that inserts an element in the tree while maintaining the red-black properties.

Here is the video of lecture ten:

Topics covered in lecture ten:

  • [00:35] Balanced search trees.
  • [02:30] Examples of balanced search tree data structures.
  • [05:16] Red-black trees.
  • [06:11] Red-black properties.
  • [11:26] Example of red-black tree.
  • [17:30] Height of red-black tree.
  • [18:50] Proof sketch of RBtree height.
  • [21:30] Connection of red-black trees to 2-3-4 trees.
  • [32:10] Running time of query operations.
  • [35:37] How to do RB-tree updates (inserts, deletes)?
  • [36:30] Tree rotations.
  • [40:55] Idea of red-black tree insert operation.
  • [44:30] Example of inserting an element in a tree.
  • [54:30] RB-Insert algorithm.
  • [01:03:35] The three cases in insert operation.

Lecture ten notes:

MIT Algorithms Lecture 10 Notes Thumbnail. Page 1 of 2.
Lecture 10, page 1 of 2.

MIT Algorithms Lecture 10 Notes Thumbnail. Page 2 of 2.
Lecture 10, page 2 of 2.

Have fun building trees! The next post will be about general methodology of augmenting data structures and it will discuss dynamic order statistics and interval trees!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):