Set Operations in the Unix Shell

A while ago I wrote about how I solved the Google Treasure Hunt Puzzle #4 about prime numbers. I took an unusual approach and solved this problem entirely from the Unix shell. The solution involved finding the intersection between a bunch of files containing numbers. This lead me to an idea to write a post about how to do various set operations from the shell by using common utilities such as sort, uniq, diff, grep, head, tail, comm, and others.

I'll cover the following set operations in this article.

Set Membership. Test if an element belongs to a set.
Set Equality. Test if two sets contain the same elements.
Set Cardinality. Return the number of elements in the set.
Subset Test. Test if a given set is a subset of another set.
Set Union. Find union of two sets.
Set Intersection. Find intersection of two sets.
Set Complement. Given two sets A and B, find all elements in A that are not in B.
Set Symmetric Difference. Find symmetric difference of two sets.
Power Set. Generate all subsets of a set.
Set Cartesian Product. Find A x B.
Disjoint Set Test. Test if two sets are disjoint.
Empty Set Test. Test if a given set is empty.
Minimum. Find the smallest element of a set.
Maximum. Find the largest element of a set.

Update: I wrote another post about these operations and created a cheat sheet.

Download cheat sheet: set operations in unix shell (.txt)

To illustrate these operations, I created a couple of random sets to work with. Each set is represented as a file with one element per line. The elements are positive numbers.

First, I created two sets A and B with 5 elements each so that I could easily check if the operations really worked.

Sets A and B are hand-crafted. It's easy to see that only elements 1, 2 and 3 are in common:

$ cat A     $ cat B
3           11
5           1
1           12
2           3
4           2

I also created a set Asub which is a subset of set A and Anotsub which is not a subset of A (to test the Subset Test operation):

$ cat Asub     $ cat Anotsub
3              6
2              7
5              8

Next, I created two equal sets Aequal and Bequal again with 5 elements each:

$ cat Aequal     $ cat Bequal
103              100
102              101
101              102
104              103
100              104

Then, I created two huge sets Abig and Bbig with 100,000 elements with repeated elements.

The easiest way to generate sets Abig and Bbig was to take natural numbers from /dev/urandom. There are two shell commands that can easily do that. The first is od and the second is hexdump.

Here is how to create two files with 100,000 natural numbers with both commands.

With hexdump:

$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig

The -e switch specifies the output format. It says - take 1 element of size 4 bytes and output it as an unsigned integer. The -n switch specifies how many bytes to read from /dev/urandom. In this case it's 400000 (400000 bytes / 4 bytes per element = 100000 elements).

With od:

$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig

The "-An" switch specifies that no line address is necessary. The "-w4" switch specifies number of bytes to output per line. The "-tu4" says to output unsigned 4-byte numbers and "-N400000" limits the output to 400000 bytes (400000/4 = 100000 elements). The output from od has to be filtered through sed to drop the leading whitespace characters.

Okay, now let's look at various set operations.

Set Membership

The set membership operation tests if an element belongs to a set. We write a ? A, if element a belongs to set A, and we write a ? A, if it does not.

The easiest way to test if an element is in a set is to use "grep" command. Grep searches the file for lines matching a pattern:

$ grep -xc 'element' set

The "-c" flag outputs number of elements in the set. If it is not a multi-set, the number of elements should be 0 or 1. The "-x" option specifies to match the whole line only (no partial matches).

Here is an example of this operation run on set A:

$ grep -xc '4' A
1
$ grep -xc '999' A
0

That's correct. Set A contains element 4 but does not contain element 999.

If the membership operation has to be used from a shell script, the return code from grep can be used instead. Unix commands succeed if the return code is 0, and fail otherwise:

$ grep -xq 'element' set
# returns 0 if element ? set
# returns 1 if element ? set

The "-q" flag makes sure that grep does not output the element if it is in the set.

Set Equality

The set equality operation tests if two sets are the same, i.e., contain the same elements. We write A = B if sets A and B are equal and A ? B if they are not.

The easiest way to test if two sets are equal is to use "diff" command. Diff command compares two files for differences. It will find that the order of lines differ, so the files have to be sorted first. If they are multi-sets, the output of sort has to be run through "uniq" command to eliminate duplicate elements:

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# returns 0 if set1 = set2
# returns 1 if set1 ? set2

The "-q" flag quiets the output of diff command.

Let's test this operation on sets A, B, Aequal and Bequal:

$ diff -q <(sort A | uniq) <(sort B | uniq)
# return code 1 -- sets A and B are not equal

$ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq)
# return code 0 -- sets A and B are equal

If you have already sorted sets, then just run:

$ diff -q set1 set2

Set Cardinality

The set cardinality operations returns the number of elements in the set. We write |A| to denote the cardinality of the set A.

The simplest way to count the number of elements in a set is to use "wc" command. Wc command counts the number of characters, words or lines in a file. Since each element in the set appears on a new line, counting the number of lines in the file will return the cardinality of the set:

$ wc -l set | cut -d' ' -f1

Cut command is necessary because "wc -l" also outputs the name of the file it was ran on. The cut command outputs the first field which is number of lines in the file.

We can actually get rid of cut:

$ wc -l < set

Let's test if on sets A and Abig:

$ wc -l A | cut -d' ' -f1
5

$ wc -l Abig | cut -d' ' -f1
100000

$ wc -l < A
5

$ wc -l < Abig 
100000

Subset Test

The subset test tests if the given set is a subset of another set. We write S ? A if S is a subset of A, and S ? A, if it's not.

I found a very easy way to do it using the "comm" utility. Comm compares two sorted files line by line. It may be run in such a way that it outputs lines that appear only in the first specified file. If the first file is subset of the second, then all the lines in the 1st file also appear in the 2nd, so no output is produced:

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# comm returns no output if subset ? set
# comm outputs something if subset ? set

Please remember that if you have a numeric set, then sort must take "-n" option.

Let's test if Asub is a subset of A:

$ comm -23 <(sort -n Asub|uniq) <(sort -n A|uniq) | head -1
# no output - yes, Asub ? A

Now let's test if Anotsub is a subset of A:

$ comm -23 <(sort -n Anotsub|uniq) <(sort -n A|uniq) | head -1
6 # has output - no, Anotsub ? A

If you want to use it from a shell script, you'd have to test if the output from this command was empty or not.

Set Union

The set union operation unions two sets, i.e., join them into one set. We write C = A ? B to denote union of sets A and B which produces set C.

Set union is extremely easy to create. Just use the "cat" utility to concatenate two files:

$ cat set1 set2

If the duplicates (elements which are both in set1 and set2) are not welcome, then the output of cat can be filtered via awk:

 
$ cat set1 set2 | awk '!found[$1]++'

# we can also get rid of cat by just using awk:

$ awk '!found[$1]++' set1 set2

If we don't want to use awk, which is a whole-blown programming language, then we can sort the output of cat and filter it via uniq:

$ cat set1 set2 | sort | uniq

# we can get rid of cat by specifying arguments to sort:

$ sort set1 set2 | uniq

# finally we can get rid of uniq by specifying -u flag to sort

$ sort -u set1 set2

If the sets set1 and set2 are already sorted, then the union operation can be made much faster by specifying the "-m" command line option, which merges the files (like the final step of merge-sort algorithm):

$ sort -m set1 set2 | uniq

# or

$ set -um set1 set2

Let's test this operation on sets A and B:

$ cat A B # with duplicates
3
5
1
2
4
11
1
12
3
2

$ awk '!found[$1]++' # without dupes
3
5
1
2
4
11
12

$ sort -n A B | uniq # with sort && uniq
1
2
3
4
5
11
12

Set Intersection

The set intersection operation finds elements that are in both sets at the same time. We write C = A ? B to denote the intersection of sets A and B, which produces the set C.

There are many ways to do set intersection. The first way that I am going to show you uses "comm":

$ comm -12 <(sort set1) <(sort set2)

The "-12" option to comm directs it to suppress output of lines appearing just in the 1st and the 2nd file and makes it output lines appearing in both 1st and 2nd, which is the intersection of two sets.

Please remember that if you have a numeric set, then sort must take "-n" option.

Another way to do it is to use "grep" utility. I actually found about this method as I was writing this article:

$ grep -xF -f set1 set2

The "-x" option forces grep to match the whole lines (no partial matches). The "-f set1" specifies the patterns to use for searching. The "-F" option makes grep interpret the given patterns literally (no regexes). It works by matching all lines of set1 in set2. The lines that appear just in set1 or just in set2 are never output.

The next way to find intersection is by using "sort" and "uniq":

$ sort set1 set2 | uniq -d

The "-d" option to uniq forces it to print only the duplicate lines. Obviously, if a line appears in set1 and set2, after sorting there will be two consecutive equal lines in the output. The "uniq -d" command prints such repeated lines (but only 1 copy of it), thus it's the intersection operation.

Just a few minutes before publishing this article I found another way to do intersection with "join" command. Join command joins files on a common field:

$ join <(sort -n A) <(sort -n B)

Here is a test run:

$ sort -n A B | uniq -d
1
2
3

$ grep -xF -f A B
1
3
2

$ comm -12 <(sort -n A) <(sort -n B)
1
2
3

Set Complement

The set complement operation finds elements that are in one set but not the other. We write A - B or A \ B to denote set's B complement in set A.

Comm has become a pretty useful command for operating on sets. It can be applied to implement set complement operation as well:

$ comm -23 <(sort set1) <(sort set2)

The option "-23" specifies that comm should not print elements that appear just in set2 and that are common to both. It leaves comm to print elements which are just in set1 (and not in set2).

The "grep" command can also be used to implement this operation:

$ grep -vxF -f set2 set1

Notice that the order of sets has been reversed from that of comm. That's because we are searching those elements in set1, which are not in set2.

Another way to do it is, of course, with "sort" and "uniq":

$ sort set2 set2 set1 | uniq -u

This is a pretty tricky command. Suppose that a line appears in set1 but does not appear in set2. Then it will be output just once and will not get removed by uniq. All other lines get removed.

Let's put these commands to test:

$ comm -23 <(sort -n A) <(sort -n B)
4
5

$ grep -vxF -f B A
5
4

$ sort -n B B A | uniq -u
4
5

Set Symmetric Difference

The set symmetric difference operation finds elements that are in one set, or in the other but not both. We write A ? B to denote symmetric difference of sets A and B.

The operation can be implemented very easily with "comm" utility:

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'

# sed can be replaced with tr

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'

Here comm is instructed via "-3" not to output fields that are common to both files, but to output fields that are just in set1 and just in set2. Sed is necessary because comm outputs two columns of data and some of it is right padded with a \t tab character.

It can also be done with "sort" and "uniq":

$ sort set1 set2 | uniq -u

We can use mathematics and derive a few formulas involving previously used operations for symmetric difference: A ? B = (A - B) ? (B - A). Now we can use grep:

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
# does (B - A) ? (A - B)

# this can be simplified

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

Let's test it:

$ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'
11
12
4
5

$ sort -n A B | uniq -u
4
5
11
12

$ cat <(grep -vxF -f B A) <(grep -vxF -f A B)
5
4
11
12

Power Set

The power set operation generates a power-set of a set. What's a power set? It's a set that contains all subsets of the set. We write P(A) or 2^A to denote all subsets of A. For a set with n elements, the power set contains 2ⁿ elements.

For example, the power-set of the set { a, b, c } contains 2³ = 8 elements. The power-set is { {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} }.

It's not easy to do that with simple Unix tools. I could not think of anything better than a silly Perl solution:

$ perl -le '
sub powset {
 return [[]] unless @_;
 my $head = shift;
 my $list = &powset;
 [@$list, map { [$head, @$_] } @$list]
}
chomp(my @e = <>);
for $p (@{powset(@e)}) {
 print @$p;
}' set

Can you think of a way to do it with Unix tools?

Set Cartesian Product

The set Cartesian product operation produces produces a new set that contains all possible pairs of elements from one set and the other. The notation for Cartesian product of sets A and B is A x B.

For example, if set A = { a, b, c } and set B = { 1, 2 } then the Cartesian product A x B = { (a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2) }.

I can't think of a great solution. I have a very silly solution in bash:

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2

Can you think of other solutions?

Disjoint Set Test

The disjoint set test operation finds if two sets are disjoint, i.e., they do not contain common elements.

Two sets are disjoint if their intersection is the empty set. Any of the set intersection commands (mentioned earlier) can be applied on the sets and the output can be tested for emptiness. If it is empty, then the sets are disjoint, if it is not, then the sets are not disjoint.

Another way to test if two sets are disjoint is to use awk:

$ awk '{ if (++seen[$0]==2) exit 1 }' set1 set2
# returns 0 if sets are disjoint
# returns 1 if sets are not disjoint

It works by counting seen elements in set1 and then set2. If any of the elements appear both in set1 and set2, seen count for that element would be 2 and awk would quit with exit code 1.

Empty Set Test

The empty set test tests if the set is empty, i.e., contains no elements. The empty set is usually written as Ø.

It's very easy to test if the set is empty. The cardinality of an empty set is 0:

$ wc -l set | cut -d' ' -f1
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Getting rid of cut:

$ wc -l < set
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Minimum

The minimum operation returns the smallest number in the set. We write min(A) to denote the minimum operation on the set A.

The minimum element of a set can be found by first sorting it in ascending order and then taking the first element. The first element can be taken with "head" Unix command which outputs the first part of the file:

$ head -1 <(sort set)

The "-1" option specifies to output the first line only.

If the set is already sorted, then it's even simpler:

$ head -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running minimum operation on sets A and Abig:

$ head -1 <(sort -n A)
1
$ head -1 <(sort -n Abig)
2798

Maximum

The maximum operation returns the biggest number in the set. We write max(A) to denote the maximum operation on the set A.

The maximum element of a set can be found by first sorting it in ascending order and then taking the last element. The last element can be taken with "tail" Unix command which outputs the last part of the file:

$ tail -1 <(sort set)

The "-1" option specifies to output the last line only.

If the set is already sorted, then it's even simpler:

$ tail -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running maximum operation on sets A and Abig:

$ tail -1 <(sort -n A)
5
$ head -1 <(sort -n Abig)
4294906714

Have fun learning all this and see you next time!