set operations

A while ago I wrote about how I solved the Google Treasure Hunt Puzzle Nr. 4 about prime numbers. I took an unusual approach and solved this problem entirely from the Unix shell. The solution involved finding the intersection between a bunch of files containing numbers. This lead me to an idea to write a post about how to do various set operations from the shell by using common utilities such as sort, uniq, diff, grep, head, tail, comm, and others.

I'll cover the following set operations in this article:

  • Set Membership. Test if an element belongs to a set.
  • Set Equality. Test if two sets contain the same elements.
  • Set Cardinality. Return the number of elements in the set.
  • Subset Test. Test if a given set is a subset of another set.
  • Set Union. Find union of two sets.
  • Set Intersection. Find intersection of two sets.
  • Set Complement. Given two sets A and B, find all elements in A that are not in B.
  • Set Symmetric Difference. Find symmetric difference of two sets.
  • Power Set. Generate all subsets of a set.
  • Set Cartesian Product. Find A x B.
  • Disjoint Set Test. Test if two sets are disjoint.
  • Empty Set Test. Test if a given set is empty.
  • Minimum. Find the smallest element of a set.
  • Maximum. Find the largest element of a set.

Update: I wrote another post about these operations and created a cheat sheet.
Download cheat sheet: set operations in unix shell (.txt)

To illustrate these operations, I created a few random sets to work with. Each set is represented as a file with one element per line. The elements are positive numbers.

First I created two sets A and B with 5 elements each so that I could easily check that the operations really work.

Sets A and B are hand crafted. It's easy to see that only elements 1, 2 and 3 are in common:

$ cat A     $ cat B
3           11
5           1
1           12
2           3
4           2

I also created a set Asub which is a subset of set A and Anotsub which is not a subset of A (to test the Subset Test operation):

$ cat Asub     $ cat Anotsub
3              6
2              7
5              8

Next I created two equal sets Aequal and Bequal again with 5 elements each:

$ cat Aequal     $ cat Bequal
103              100
102              101
101              102
104              103
100              104

Then I created two huge sets Abig and Bbig with 100,000 elements (some of them are repeated, but that's ok).

The easiest way to generate sets Abig and Bbig is to take natural numbers from /dev/urandom. There are two shell commands that can easily do that. The first is "od" and the second is "hexdump".

Here is how to create two files with 100,000 natural numbers with both commands.

With hexdump:

$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig

The "-e" switch specifies a hand-crafted output format. It says take 1 element of size 4 bytes and output it as an unsigned integer. The "-n" switch specifies how many bytes to read, in this case 400000 (400000 bytes / 4 bytes per element = 100000 elements).

With od:

$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig

The "-An" switch specifies that no line address is necessary. The "-w4" switch specifies number of bytes to output per line. The "-tu4" says to output unsigned 4-byte numbers and "-N400000" limits the output to 400000 bytes (400000/4 = 100000 elements). The output from od has to be filtered through sed to drop the leading whitespace characters.

Okay, now let's look at various set operations.

Set Membership

The set membership operation tests if an element belongs to a set. We write aA, if element a belongs to set A, and we write aA, if it does not.

The easiest way to test if an element is in a set is to use "grep" command. Grep searches the file for lines matching a pattern:

$ grep -xc 'element' set

The "-c" flag outputs number of elements in the set. If it is not a multi-set, the number of elements should be 0 or 1. The "-x" option specifies to match the whole line only (no partial matches).

Here is an example of this operation run on set A:

$ grep -xc '4' A
$ grep -xc '999' A

That's correct. Set A contains element 4 but does not contain element 999.

If the membership operation has to be used from a shell script, the return code from grep can be used instead. Unix commands succeed if the return code is 0, and fail otherwise:

$ grep -xq 'element' set
# returns 0 if element ∈ set
# returns 1 if element ∉ set

The "-q" flag makes sure that grep does not output the element if it is in the set.

Set Equality

The set equality operation tests if two sets are the same, i.e., contain the same elements. We write A = B if sets A and B are equal and AB if they are not.

The easiest way to test if two sets are equal is to use "diff" command. Diff command compares two files for differences. It will find that the order of lines differ, so the files have to be sorted first. If they are multi-sets, the output of sort has to be run through "uniq" command to eliminate duplicate elements:

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# returns 0 if set1 = set2
# returns 1 if set1 ≠ set2

The "-q" flag quiets the output of diff command.

Let's test this operation on sets A, B, Aequal and Bequal:

$ diff -q <(sort A | uniq) <(sort B | uniq)
# return code 1 -- sets A and B are not equal

$ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq)
# return code 0 -- sets A and B are equal

If you have already sorted sets, then just run:

$ diff -q set1 set2

Set Cardinality

The set cardinality operations returns the number of elements in the set. We write |A| to denote the cardinality of the set A.

The simplest way to count the number of elements in a set is to use "wc" command. Wc command counts the number of characters, words or lines in a file. Since each element in the set appears on a new line, counting the number of lines in the file will return the cardinality of the set:

$ wc -l set | cut -d' ' -f1

Cut command is necessary because "wc -l" also outputs the name of the file it was ran on. The cut command outputs the first field which is number of lines in the file.

We can actually get rid of cut:

$ wc -l < set

Let's test if on sets A and Abig:

$ wc -l A | cut -d' ' -f1

$ wc -l Abig | cut -d' ' -f1

$ wc -l < A

$ wc -l < Abig 

Subset Test

The subset test tests if the given set is a subset of another set. We write SA if S is a subset of A, and SA, if it's not.

I found a very easy way to do it using the "comm" utility. Comm compares two sorted files line by line. It may be run in such a way that it outputs lines that appear only in the first specified file. If the first file is subset of the second, then all the lines in the 1st file also appear in the 2nd, so no output is produced:

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# comm returns no output if subset ⊆ set
# comm outputs something if subset ⊊ set

Please remember that if you have a numeric set, then sort must take "-n" option.

Let's test if Asub is a subset of A:

$ comm -23 <(sort -n Asub|uniq) <(sort -n A|uniq) | head -1
# no output - yes, Asub ⊆ A

Now let's test if Anotsub is a subset of A:

$ comm -23 <(sort -n Anotsub|uniq) <(sort -n A|uniq) | head -1
6 # has output - no, Anotsub ⊊ A

If you want to use it from a shell script, you'd have to test if the output from this command was empty or not.

Set Union

The set union operation unions two sets, i.e., join them into one set. We write C = AB to denote union of sets A and B which produces set C.

Set union is extremely easy to create. Just use the "cat" utility to concatenate two files:

$ cat set1 set2

If the duplicates (elements which are both in set1 and set2) are not welcome, then the output of cat can be filtered via awk:

$ cat set1 set2 | awk '!found[$1]++'

# we can also get rid of cat by just using awk:

$ awk '!found[$1]++' set1 set2

If we don't want to use awk, which is a whole-blown programming language, then we can sort the output of cat and filter it via uniq:

$ cat set1 set2 | sort | uniq

# we can get rid of cat by specifying arguments to sort:

$ sort set1 set2 | uniq

# finally we can get rid of uniq by specifying -u flag to sort

$ sort -u set1 set2

If the sets set1 and set2 are already sorted, then the union operation can be made much faster by specifying the "-m" command line option, which merges the files (like the final step of merge-sort algorithm):

$ sort -m set1 set2 | uniq

# or

$ set -um set1 set2

Let's test this operation on sets A and B:

$ cat A B # with duplicates

$ awk '!found[$1]++' # without dupes

$ sort -n A B | uniq # with sort && uniq

Set Intersection

The set intersection operation finds elements that are in both sets at the same time. We write C = AB to denote the intersection of sets A and B, which produces the set C.

There are many ways to do set intersection. The first way that I am going to show you uses "comm":

$ comm -12 <(sort set1) <(sort set2)

The "-12" option to comm directs it to suppress output of lines appearing just in the 1st and the 2nd file and makes it output lines appearing in both 1st and 2nd, which is the intersection of two sets.

Please remember that if you have a numeric set, then sort must take "-n" option.

Another way to do it is to use "grep" utility. I actually found about this method as I was writing this article:

$ grep -xF -f set1 set2

The "-x" option forces grep to match the whole lines (no partial matches). The "-f set1" specifies the patterns to use for searching. The "-F" option makes grep interpret the given patterns literally (no regexes). It works by matching all lines of set1 in set2. The lines that appear just in set1 or just in set2 are never output.

The next way to find intersection is by using "sort" and "uniq":

$ sort set1 set2 | uniq -d

The "-d" option to uniq forces it to print only the duplicate lines. Obviously, if a line appears in set1 and set2, after sorting there will be two consecutive equal lines in the output. The "uniq -d" command prints such repeated lines (but only 1 copy of it), thus it's the intersection operation.

Just a few minutes before publishing this article I found another way to do intersection with "join" command. Join command joins files on a common field:

$ join <(sort -n A) <(sort -n B)

Here is a test run:

$ sort -n A B | uniq -d

$ grep -xF -f A B

$ comm -12 <(sort -n A) <(sort -n B)

Set Complement

The set complement operation finds elements that are in one set but not the other. We write A - B or A \ B to denote set's B complement in set A.

Comm has become a pretty useful command for operating on sets. It can be applied to implement set complement operation as well:

$ comm -23 <(sort set1) <(sort set2)

The option "-23" specifies that comm should not print elements that appear just in set2 and that are common to both. It leaves comm to print elements which are just in set1 (and not in set2).

The "grep" command can also be used to implement this operation:

$ grep -vxF -f set2 set1

Notice that the order of sets has been reversed from that of comm. That's because we are searching those elements in set1, which are not in set2.

Another way to do it is, of course, with "sort" and "uniq":

$ sort set2 set2 set1 | uniq -u

This is a pretty tricky command. Suppose that a line appears in set1 but does not appear in set2. Then it will be output just once and will not get removed by uniq. All other lines get removed.

Let's put these commands to test:

$ comm -23 <(sort -n A) <(sort -n B)

$ grep -vxF -f B A

$ sort -n B B A | uniq -u

Set Symmetric Difference

The set symmetric difference operation finds elements that are in one set, or in the other but not both. We write A Δ B to denote symmetric difference of sets A and B.

The operation can be implemented very easily with "comm" utility:

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'

# sed can be replaced with tr

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'

Here comm is instructed via "-3" not to output fields that are common to both files, but to output fields that are just in set1 and just in set2. Sed is necessary because comm outputs two columns of data and some of it is right padded with a \t tab character.

It can also be done with "sort" and "uniq":

$ sort set1 set2 | uniq -u

We can use mathematics and derive a few formulas involving previously used operations for symmetric difference: A Δ B = (A - B) ∪ (B - A). Now we can use grep:

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
# does (B - A) ∪ (A - B)

# this can be simplified

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

Let's test it:

$ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'

$ sort -n A B | uniq -u

$ cat <(grep -vxF -f B A) <(grep -vxF -f A B)

Power Set

The power set operation generates a power-set of a set. What's a power set? It's a set that contains all subsets of the set. We write P(A) or 2A to denote all subsets of A. For a set with n elements, the power set contains 2n elements.

For example, the power-set of the set { a, b, c } contains 23 = 8 elements. The power-set is { {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} }.

It's not easy to do that with simple Unix tools. I could not think of anything better than a silly Perl solution:

$ perl -le '
sub powset {
 return [[]] unless @_;
 my $head = shift;
 my $list = &powset;
 [@$list, map { [$head, @$_] } @$list]
chomp(my @e = <>);
for $p (@{powset(@e)}) {
 print @$p;
}' set

Can you think of a way to do it with Unix tools?

Set Cartesian Product

The set Cartesian product operation produces produces a new set that contains all possible pairs of elements from one set and the other. The notation for Cartesian product of sets A and B is A x B.

For example, if set A = { a, b, c } and set B = { 1, 2 } then the Cartesian product A x B = { (a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2) }.

I can't think of a great solution. I have a very silly solution in bash:

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2

Can you think of other solutions?

Disjoint Set Test

The disjoint set test operation finds if two sets are disjoint, i.e., they do not contain common elements.

Two sets are disjoint if their intersection is the empty set. Any of the set intersection commands (mentioned earlier) can be applied on the sets and the output can be tested for emptiness. If it is empty, then the sets are disjoint, if it is not, then the sets are not disjoint.

Another way to test if two sets are disjoint is to use awk:

$ awk '{ if (++seen[$0]==2) exit 1 }' set1 set2
# returns 0 if sets are disjoint
# returns 1 if sets are not disjoint

It works by counting seen elements in set1 and then set2. If any of the elements appear both in set1 and set2, seen count for that element would be 2 and awk would quit with exit code 1.

Empty Set Test

The empty set test tests if the set is empty, i.e., contains no elements. The empty set is usually written as Ø.

It's very easy to test if the set is empty. The cardinality of an empty set is 0:

$ wc -l set | cut -d' ' -f1
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty

Getting rid of cut:

$ wc -l < set
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty


The minimum operation returns the smallest number in the set. We write min(A) to denote the minimum operation on the set A.

The minimum element of a set can be found by first sorting it in ascending order and then taking the first element. The first element can be taken with "head" Unix command which outputs the first part of the file:

$ head -1 <(sort set)

The "-1" option specifies to output the first line only.

If the set is already sorted, then it's even simpler:

$ head -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running minimum operation on sets A and Abig:

$ head -1 <(sort -n A)
$ head -1 <(sort -n Abig)


The maximum operation returns the biggest number in the set. We write max(A) to denote the maximum operation on the set A.

The maximum element of a set can be found by first sorting it in ascending order and then taking the last element. The last element can be taken with "tail" Unix command which outputs the last part of the file:

$ tail -1 <(sort set)

The "-1" option specifies to output the last line only.

If the set is already sorted, then it's even simpler:

$ tail -1 set

Remember to use "sort -n" command if the set contains numeric data.

Example of running maximum operation on sets A and Abig:

$ tail -1 <(sort -n A)
$ head -1 <(sort -n Abig)

Have Fun!

Have fun working with these set operations! Thanks to lhunath and waldner from #bash for helping. :)


October 14, 2008, 02:30

Aww, I was disappointed by the power set example. You see, back in the mid 1990's, when we were implementing gcc multilib support (this is the idea that if the user gives gcc flags that lead to incompatible ABIs, you need to supply corresponding libgcc and possibly libc/libm that match...) at Cygnus (mostly for embedded work) I had to implement power set in sh (after all, the user could combine the incompatible options) - and not even full POSIX sh, but a subset that pre-autoconf configure scripts were allowed to use (so that they actually worked on things like LynxOS.)

I mostly remember it being unpleasant, and hoped I'd find something more modern and elegant here :-)

(Eventually Ian Taylor rewrote all that, and I don't know what the current gcc-multilib packaging does - avoiding the problem and specifying the desired combinations explicitly would not have been wrong...)

alpha Permalink
October 14, 2008, 08:16

I mostly remember it being unpleasant, and hoped I’d find something more modern and elegant here :-)

Easiest way to do is to observe that for a set of size N, each element in the power set corresponds to a binary vector of length N. For example, for a set of size 3 with elements {A,B,C}

000 --> {}
001 --> {A}
010 --> {B}
111 --> {A,B,C}

So to enumerate the elements of a power set, just iterate over the 2^N possible values of a binary vector of length N.

Harry Permalink
October 14, 2008, 08:16

Minor typo: The od command example and its description are not consistent in the mention of arguments.

October 14, 2008, 08:29

Nice article!

In the empty test set example, did you mean 'cut' instead of 'set' after the pipe symbol?

October 14, 2008, 08:31

A bash-only powerset could look like:

powerset_overkill() {
        local Set=$*
        if [ "${Set}" = "" ]; then
        echo ${Set}
        for i in ${Set}; do
                local Subset=${Set/${i}/}
                powerset_overkill ${Subset}

powerset() {
        powerset_overkill $* | sort | uniq

time powerset 1 2 3 4 5 6  
October 14, 2008, 09:04

For the power set, you could use shell functions, e.g.

p() { [ $# -eq 0 ] && echo || (shift; p "$@") | while read r ; do echo -e "$1 $r\n$r"; done }


p `cat set`

outputs the power set, e.g.

$ p 1 2 3
1 2 3
2 3
1 3
1 2

Of course, this is limited to max_args of bash, but who needs the power set of >500 elements anyway? ;-)

October 14, 2008, 09:43

Hi Peter. Nice article, which intersects with something I wrote last year about shell scripting, using the set operations example. It's always good to compare notes on how someone else tackles the same problem.

By the way, there's no need to start a pipeline off with cat. Instead of

$ cat set1 set2 | sort -m | uniq

edit: Thomas wrote this example and then noticed that it was not correct. He commented:
Now that I think about it

$ cat set1 set2 | sort -m

is wrong!

Sort -m only works when merging multiple sorted files together: catting the inputs into a single file is a bad idea in this case.

you could just have

$ sort -m set1 set2 | uniq

or even (with gnu sort)

$ sort -mu set1 set2
Tet Permalink
October 14, 2008, 11:50

The -m and -u options to sort are fine anywhere. Those are not unique to GNU sort:

October 14, 2008, 14:43

Thanks for that. I'll analyze it in detail later.
For now here are my methods for set operations:

October 14, 2008, 17:37

Nice stuff. :-) If you're working with really big files, then you'll have problems with diff because it tries to read the whole file into memory. You can use 'cmp' for equality checks instead

October 14, 2008, 23:05

The utility blm is the Boolean Line Manipulator and supports all of the aboved mentioned operations. It is also more than 1000 times faster than the shell and it is much easier to read and write complex operations. The download page for blm is linked in this post and it is auto-installable in Debian and Ubuntu as the blm package.

October 15, 2008, 14:10

It's definitely simpler and possibly more efficient to use `combine` for set operations:

Ralph Corderoy Permalink
October 16, 2008, 17:11

$ set -um set1 set2

Think you mean sort instead of set there.

BTW, I think the normal definition of a set is that an element can't appear in it more than once, and that order doesn't matter. So a lot of your "sets" aren't, in fact.

November 09, 2008, 05:08

Great! Now I just wish I had photographic memory to remember all this :)

December 26, 2008, 10:52

Nice stuff, there - thanks for posting! I kept moving through the page by first trying out to solve by myself before proceeding on to see how you've solved - some exercise to my mind.

caner Permalink
April 02, 2009, 02:44

you should give ipython a try. see youve made yourself comfortable with bash and unix commands but ipython will make things much easier for such tasks. nice post btw!

argv Permalink
July 30, 2009, 01:57

As a physicist who was quite good with awk once said,

"Those who do not understand UNIX are doomed to reimplement it [,poorly]."

It is the temptation to learn higher level languages that preempts coders from learning the UNIX tools. And they just end up reinventing the wheel in perl, ruby, python, javascript, etc.

Perl was written out of laziness, because someone was too lazy to learn awk well enough to do a certain manipulation of log files, and out of frustration he wrote Perl. The physicist quoted above later made him realise it was possible to do the task with awk.

I liked Thomas Guest's article.

If you want to learn something "new", look no further than your base system. It seems (though I cannot confirm) fewer and fewer people are learning the UNIX tools, *thoroughly*. It seems (though I cannot confirm) more and more people see them at a stepping stone.

I refer the reader again to the quote I started off with.

With the UNIX tools, one can do more with less (allocated memory). No task is too big. Divide and conquer. Brute force is always an option, as Ritchie once said.

SSMC Permalink
August 14, 2009, 13:14

It's articles like this one that makes me want to give the www intertubes one big squishy hug! Thank you this article was exactly what I needed...

Paul DeLong Permalink
August 27, 2009, 17:33

Here's a refinement to the cardinality operation:

sort -u set | wc -l

Also, just a minor correction:

head -1 should probably read:
tail -1 (in the last example of the maximum operation)
SGB Permalink
November 24, 2009, 13:55

Amazing stuff.
I needed to do some of these set operations at work in a jiffy and this page helped a lot. It has been a while since I did much stuff in Unix (10 years) and had forgotten most of it and how powerful it can be. I think I should get a book and get back into Unix and unleash the power of the shell. Perhaps this time, I would get around to learning awk.

Thank you!

January 25, 2010, 17:33

AWESOME article; outstanding solutions for just about everything except the power set problem.

Here's an idea for a future article: associative array operations in unix shell

MANY developers jump to perl the moment a problem seems to require associative arrays; I've never understood this; it seems to me that similar strategies to the ones you've employed for sets could be employed for associative arrays

Thanks for your efforts to keep the UNIX way alive and well.

March 18, 2010, 10:23

Set Cartesian Product. Find A x B
|cat N
|cat A

|cat A |while read a ; do
cat N |sed -e "s/./& $a/g"

Oh well, at-least one while is removed. I could have done better with xargs and removed the second while too, but it isn't like find unfortunately in that it doesn't allow me to embed pipelines in the arg

March 18, 2010, 10:31

Oh well, I am cheating :)

|cat A |xargs -I{} -n 1 sh -c 'cat N |xargs -n1 echo {} '

March 18, 2010, 11:12

Advised to run only with smaller sets :)
Not a good solution but just trying to do it with pipelines alone.

cat A |xargs -n 1 ksh -c 'cat A |xargs |sed -e "s; ;,;g" -e "s;.*;{&};g" '|xargs|sed -e 's; ;\\\\\\\ ;g'|xargs -I{} ksh -c 'echo {}' > tmp
cat tmp |xargs -n 3 |cat |xargs -I{} ksh -c 'echo {} |xargs -n 1 |sort -u |xargs ' |sort -u
$cat N
|cat A
$ cat A |xargs -I{} -n 1 sh -c 'cat N |xargs -n1 echo {} '
a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2
c 3

Power set:

$ cat A |
   xargs -n 1 ksh -c 'cat A |xargs |sed -e "s; ;,;g" -e "s;.*;{&};g" ' |
       xargs |sed -e 's; ;\\\\\\\ ;g'|xargs -I{} ksh -c 'echo {}'|
           xargs -n `wc -l <A` |
               xargs -I{} sh -c 'echo {}|xargs -n 1|sort -u|xargs ' |
                   sort -u


a b
a b c
a c
b c
March 18, 2010, 11:14

Rahul, thanks for these. I never figured out how to do cartesian in an interesting manner in the shell. I actually sent you an email because I was afraid they got broken by this old blog software.

March 18, 2010, 15:19

cartesian product:

$ cat A

$ cat B

$ join -j2 A B
a 1
a 2
b 1
b 2
c 1
c 2

or, if you like

$ join -j2 A B | awk '{print "("$1", "$2")"}'
(a, 1)
(a, 2)
(b, 1)
(b, 2)
(c, 1)
(c, 2)

March 24, 2010, 13:55

I love it. Anyone mentioned using that set expansion for a calculator?

$ echo {1,2,3}{1,2,3,4,5} | wc -w

Also, you can get it to print every number from 1
to 99

echo {,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}

January 08, 2011, 19:33

... but why should one ever remember crypted bash/zsh/ksh/... syntax and it's subtle differences _iff_ script languages like perl, python, _guarantee_ an abstraction level _plus_ simplifying all those tasks _plus_ adding syntactic sugar (in a _per_language_ and not _per_platform_) consistent manner?

i do honor the effort and i do honor being able to accomplish a lot of solutions w/ the most basic tools, _but_ i prefer to be able to move to solaris and not have to reinvent everything again


dunno Permalink
June 28, 2011, 19:44

great article! just a couple of small things:

$ awk '!found[$1]++' # without dupes

(missing input files)

$ cat > Atest

$ cat > Btest

$ sort Atest Btest | uniq -d

will show duplicates within a single multigroup as well as items common to them both.

dunno Permalink
June 28, 2011, 20:40

also there's a typos @ $ set -um set1 set2, $ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'

btw, i would really appreciate a post comparing the efficiency of some of the solutions when there're several provided.

another thing, being the n00b that i am i can't figure out how to force the following being evaluated echo {$(paste -sd, A)}{$(paste -sd, B)} this returns {3,5,1,2,4}{11,1,12,3,2} but if i wrote directly echo {3,5,1,2,4}{11,1,12,3,2} i'd get the desire cross product?

dunno Permalink
June 28, 2011, 20:50

Christopher Suters solution is really nice, didn't know there's a join command, still using a paste command it can be done like this:

eval echo -e {$(paste -sd, A)},{$(paste -sd, B)} | tr ' ' '\n'

could you please fix my precious comment?

Wenslayer Permalink
February 03, 2012, 19:21

Here's my take on power set without resorting to a scripting language:

(echo -e "echo \c"; for set in set1 set2 set3; do echo -e "{`sort -u $set|xargs|tr ' ' ','`,}\c"; done)|bash -|xargs -n1

For example, here are three files representing different sets:

$ head set*
==> set1 <==
==> set2 <==
==> set3 <==

And here's the power set of them:

$ (echo -e "echo \c"; for set in set1 set2 set3; do echo -e "{`sort -u $set|xargs|tr ' ' ','`,}\c"; done) | bash -s | xargs -n1


Wenslayer Permalink
February 03, 2012, 19:25

Whoops, could make it easier by using eval:

$ eval $(echo -e "echo \c"; for set in set1 set2 set3; do echo -e "{`sort -u $set|xargs|tr ' ' ','`,}\c"; done)|xargs -n1
June 06, 2012, 05:51

It took me a bit of hunting to figure out that <(list) creates a named pipe.

DanL Permalink
January 18, 2013, 01:09

The only problem with producing large sets with the methods outlined above is that there may be duplicates in sets. This messed with the complement operations when a duplicate is in set1: sort set2 set2 set1|uniq -u doesn't show any duplicates, comm outputs a de-duped set, while grep leaves the duplicates in set1 unmolested.

I guess the lesson is: make sure your sets are true sets first.

Henning Permalink
April 07, 2013, 17:27

Power set:

$ cat set
$ eval echo {$(cat set | xargs -I{} echo -n \{{},\})}
{abc} {ab} {ac} {a} {bc} {b} {c} {}

or slightly shorter:

$ eval echo {$(cat set | xargs -I# echo -n \{#,\})}

I've got the idea when I saw the wonderful combinatorical feature mentioned under #6 in this article. First I found (after some tries) this which would give the desired result:

$ echo {{,a}{,b}{,c}}
{} {c} {b} {bc} {a} {ac} {ab} {abc}

(The outmost braces are nessesary for the empty set.) But here we manually type in the members of the set. To build the expression from the content of the file "set" I went the following steps:

$ cat set |xargs -I{} echo \{{},\}

We need it in one line, so we append -n to echo:

cat set |xargs -I{} echo -n \{{},\}

(We dont get a newline after that!)
To surround this with curly brackets we use command substition and another echo:

$ echo {$(cat set |xargs -I{} echo -n \{{},\})}

With eval we ask the bash do brace expansion again:

$ eval echo {$(cat set |xargs -I{} echo -n \{{},\})}
{abc} {ab} {ac} {a} {bc} {b} {c} {}
Brett Estrade Permalink
April 24, 2013, 16:01

I think it should be stressed, if not already, that "uniq" is sensitive to whitespace. It caused me some issues until I s/ *//g all white space in each line using sed before the uniq.

December 07, 2013, 19:30

Bash Solution to union:

# Usage: ./ set1 set2
declare -Ai union

while read -re element; do
done < <(cat "${@}")

declare -p union

January 27, 2015, 00:32

Over the weekend I wrote a program that I call setdown that does many set operations from files. I would really recommend that you check it out. It is written in Haskell so it is very fast and efficient:

You can install it just by running: cabal install setdown

September 07, 2015, 14:08

Thank you for your article! It just helped me a lot in figuring out which files would be affected by a merge.

Did you estimate the asymptotic runtime of your solutions? I ask because that’s one of the big strength of set operations in higher level languages: convert a list to a set and you get O(1) access time.

September 29, 2015, 05:30

Power Set of S={a,b,c}

SET=$(echo \{\,{a,b,c}\});eval echo\ {${SET// /}}

September 29, 2015, 23:08

This works in zsh but not in bash cause of the nested parameter expansion.

S=$(eval echo\ "\{\,-$1-\}")
P=$(eval echo\ "{${S// /}}")
echo {${${P//(--| )/,}//-/}}

Can be used like this: SET='{a,b,c}'; ./ $SET
It outputs: {{},{c},{b},{b,c},{a},{a,c},{a,b},{a,b,c}}

P.S. How can i delete a comment?

September 29, 2015, 23:56

Cartesian Product

join -1 2 -2 2 <(echo ${${1//,/\\n}//({|})/}) <(echo ${${2//,/\\n}//({|})/})|gsed 's/^ /(/;s/ /,/;s/$/)/;s/\n/,/'|gsed -z 's/\n/,/g;s/^/{/;s/,$/}\n/'<\pre>
December 27, 2016, 22:50

Here's a nice way to achieve the Cartesian join using only tools already present in your article, but without any general purpose languages such as awk.

join -1 1 -2 1 -o 1.2,2.2 \
<(sort -u setA.txt | sed 's/^/x /') \
<(sort -u setB.txt | sed 's/^/x /')

Leave a new comment

(why do I need your e-mail?)

(Your twitter handle, if you have one.)

Type the word "0day_97": (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.