set operations

Remember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a .txt cheat-sheet version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

Download .txt right away: set operations in unix shell (.txt)
Download path: http://www.catonmat.net/download/setops.txt

Set Membership

$ grep -xc 'element' set    # outputs 1 if element is in set
                            # outputs >1 if set is a multi-set
                            # outputs 0 if element is not in set

$ grep -xq 'element' set    # returns 0 (true)  if element is in set
                            # returns 1 (false) if element is not in set

$ awk '$0 == "element" { s=1; exit } END { exit !s }' set
# returns 0 if element is in set, 1 otherwise.

$ awk -v e='element' '$0 == e { s=1; exit } END { exit !s }'

Set Equality

$ diff -q <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2
                                    # returns 1 if set1 != set2

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# collapses multi-sets into sets and does the same as previous

$ awk '{ if (!($0 in a)) c++; a[$0] } END{ exit !(c==NR/2) }' set1 set2
# returns 0 if set1 == set2
# returns 1 if set1 != set2

$ awk '{ a[$0] } END{ exit !(length(a)==NR/2) }' set1 set2
# same as previous, requires >= gnu awk 3.1.5

Set Cardinality

$ wc -l set | cut -d' ' -f1    # outputs number of elements in set

$ wc -l < set

$ awk 'END { print NR }' set

Subset Test

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# outputs something if subset is not a subset of set
# does not putput anything if subset is a subset of set

$ awk 'NR==FNR { a[$0]; next } { if !($0 in a) exit 1 }' set subset
# returns 0 if subset is a subset of set
# returns 1 if subset is not a subset of set

Set Union

$ cat set1 set2     # outputs union of set1 and set2
                    # assumes they are disjoint

$ awk 1 set1 set2   # ditto

$ cat set1 set2 ... setn   # union over n sets

$ cat set1 set2 | sort -u  # same, but assumes they are not disjoint

$ sort set1 set2 | uniq

# sort -u set1 set2

$ awk '!a[$0]++'           # ditto

Set Intersection

$ comm -12 <(sort set1) <(sort set2)  # outputs insersect of set1 and set2

$ grep -xF -f set1 set2

$ sort set1 set2 | uniq -d

$ join <(sort -n A) <(sort -n B)

$ awk 'NR==FNR { a[$0]; next } $0 in a' set1 set2

Set Complement

$ comm -23 <(sort set1) <(sort set2)
# outputs elements in set1 that are not in set2

$ grep -vxF -f set2 set1           # ditto

$ sort set2 set2 set1 | uniq -u    # ditto

$ awk 'NR==FNR { a[$0]; next } !($0 in a)' set2 set1

Set Symmetric Difference

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'
# outputs elements that are in set1 or in set2 but not both

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'

$ sort set1 set2 | uniq -u

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

$ awk 'NR==FNR { a[$0]; next } $0 in a { delete a[$0]; next } 1;
       END { for (b in a) print b }' set1 set2

Power Set

$ p() { [ $# -eq 0 ] && echo || (shift; p "$@") |
        while read r ; do echo -e "$1 $r\n$r"; done }
$ p `cat set`

# no nice awk solution, you are welcome to email me one:
# peter@catonmat.net

Set Cartesian Product

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2

$ awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' set1 set2

Disjoint Set Test

$ comm -12 <(sort set1) <(sort set2)  # does not output anything if disjoint

$ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint
                                             # returns 1 if not

Empty Set Test

$ wc -l < set            # outputs 0  if the set is empty
                         # outputs >0 if the set is not empty

$ awk '{ exit 1 }' set   # returns 0 if set is empty, 1 otherwise

Minimum

$ head -1 <(sort set)    # outputs the minimum element in the set

$ awk 'NR == 1 { min = $0 } $0 < min { min = $0 } END { print min }'

Maximum

$ tail -1 <(sort set)    # outputs the maximum element in the set

$ awk '$0 > max { max = $0 } END { print max }'

Have Fun!

Have fun with these ops! If you can think of other solutions, or have any tips or tricks to add, please comment on the article! Thank you!

Thanks to waldner and pgas from #awk in FreeNode. And greetings to Andreas for coming up with the cool power set function for bash!

Download "Set Operations in Unix Shell" Document

Download .txt document: set operations in unix shell (.txt)
Downloaded: 10465 times
Download URL: http://www.catonmat.net/download/setops.txt

Download .pdf document: set operations in unix shell (.pdf)
Downloaded: 10897 times
Download URL: http://www.catonmat.net/download/setops.pdf

Comments

Paul Keeble Permalink
December 03, 2008, 14:31

Might it not be a useful command in linux that can actually manipulate sets. Using a similar command structure that of git (git ....)

Then all these commands would already be nicely done and debugged. You could build up a reasonable and simple bash script initially and distribute the appropriate packages.

December 03, 2008, 17:41

Peter, all these operations are very useful, thanks.

// Jadu,
http://unstableme.blogspot.com/search/label/Awk

John Permalink
December 03, 2008, 17:49

Hi Peteris,

Nice article. I enjoy your work.

I noticed a problem with one of your published methods for set intersection

sort A B | uniq -d 

will report items that are duplicated in either set A or set B as intersections. So one has to make sure A and B are uniq'd first which seems to make the command a bit more complicated and it will run a lot slower.

sort <(sort A | uniq) <(sort B | uniq) | uniq -d

Perhaps you might see a way to do this more efficiently.

Also, the join and comm methods may not work reliably on large data sets. I get no intersection on my Abig and Bbig sets. The nifty grep method seems to work the best.

purvainais Permalink
December 04, 2008, 15:12

In document download link: "Set Operatins" -- I think you mean "Set Operations"?

December 04, 2008, 20:15

purvainais, yes, I mean just that. Fixed now. Thanks!

September 15, 2013, 20:45

Zuke’s Z-Bones Natural Edible Clean Carrot Crunch Mini Dog Dental Chews(18 Pack) By:

dhanu Permalink
February 19, 2009, 08:53

Could you please explain what " set -vx " command in a shell script will do?

February 19, 2009, 09:31

dhanu, -x prints commands and their arguments as they are executed:

$ echo hi
hi
$ set -x
$ echo hi
+ echo hi
hi

And -v prints shell input lines as they are read:

$ cat script
#!/bin/bash
#

set -v

. other_script

$ cat other_script
#!/bin/bash
#

echo foo

# ./script

. other_script
#!/bin/bash
#

echo foo
foo
John Yoon Permalink
April 01, 2013, 16:58

'# ./script' -> '$ ./script'

August 04, 2009, 15:26

Very nice.

fyi, here's a slightly different approach: i started writing a commandline tool that does set operations. only intersections, unions, and (assym) differences right now, but it's very useful for the things I need to do. http://gist.github.com/22958

Leave a new comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the first letter of your name: (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.

Advertisements