This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the eleventh post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lecture sixteen, which introduces the concept of Greedy Algorithms, reviews Graphs and applies the greedy Prim's Algorithm to the Minimum Spanning Tree (MST) Problem.

The previous lecture introduced dynamic programming. Dynamic programming was used for finding solutions to optimization problems. In such problems there can be many possible solutions. Each solution has a value, and we wish to find a solution with the optimal (minimum or maximum) value. Greedy algorithms are another set of methods for finding optimal solution. A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution. Greedy algorithms do not always yield optimal solutions, but for many problems they do. In this lecture it is shown that a greedy algorithm gives an optimal solution to the MST problem.

Lecture sixteen starts with a review of graphs. It reminds that a graph G is a pair (V, E), where V is a set of vertices (nodes) and E ⊆ VxV is a set of edges. If E contains ordered pairs, the graph is directed (also called a digraph), otherwise it's undirected. It also gives facts, such as, for all graphs, |E| = O(|V|2), and if a graph is connected then |E| >= |V| - 1. It also mentions handshaking lemma - sum over all vertices of degree of the vertex = 2|E| (Σv∈Vdegree(v) = 2|E|).

The lecture continues with graph representations. Graphs are commonly stored as adjacency matrices or as adjacency lists. Adjacency matrix of a graph G = (V, E) is an nxn matrix A with elements aij = 1, if edge (i, j) ∈ E and aij = 0, if edge (i, j) ∉ E.

Here is an example of adjacency matrix for a digraph G = ({1, 2, 3, 4}, {{1, 2}, {1, 3}, {2, 3}, {4, 3}}):

Adjacency matrix example

You can see that the element aij of matrix A is 1 if there is an edge from i to j, and the element is 0 if there is no edge from i to j (i is row, j is column). The storage space required by this representation is O(|V|2). For dense graphs this is ok, but for sparse it's not ok.

Adjacency list of a given vertex v ∈ V is the list Adj[v] of vertices adjacent to v. Here is an example of adjacency list for the same graph:

Adjacency list example

Storage required by this representation is O(V+E).

The lecture progresses with the problem of minimum spanning trees (MSTs). It's stated as following: given a connected, undirected graph G = (V, E) with edge weight function w: E -> R, find a spanning tree that connects all the vertices of minimum weight.

Here is an example of a MST. The MST of the graph is colored in green. The MST connects all the vertices of the graph so that the weight of edges is minimum.

Minimum Spanning Tree

Analyzing the MST problem in more detail, it is appears that MST contains optimal substructure property - if we take a subgraph and look at its MST, that MST will be a part of the MST of the whole graph. This is dynamic programming hallmark #1 (see previous lecture). It also appears that the problem of finding an MST contains overlapping subproblems, which is dynamic programming hallmark #2. Hallmarks #1 and #2 suggest that we could use a dynamic programming algorithm to find a MST of a graph. Surprisingly, there is something more powerful than dynamic programming algorithm for this problem - a greedy algorithm.

There is a hallmark for greedy algorithms:

  • Greedy choice property - a locally optimal choice is globally optimal.

Lecture continues with a theorem and a proof that a greedy local choice for a MST is globally optimal choice. This is the key idea behind Prim's algorithm for finding a minimum spanning tree of a graph. The idea in the algorithm is to maintain a priority queue Q (keyed on minimum weight of edge connected to a vertex) of vertices that are not yet connected in the MST. At each step the algorithm takes a vertex from Q and goes over all edges, connects the edge with minimum weight to MST and updates keys of neighbor vertices (please see the lecture or lecture notes for the exact algorithm and an example).

The lecture ends with running time analysis of Prim's algorithm. As it uses a priority queue, the running time depends on implementation of the priority queue used. For example, if we use a dumb array as a priority queue, the running time is O(V2), if we use a binary heap, the running time os O(E·lg(V)), if we use a Fibonacci heap, the running time is O(E + V·lg(V)).

Another well known algorithm for MSTs is Kruskal's algorithm (runs in O(E·lg(V)), which uses a disjoint-set data structure, that is not covered in this course (but is covered in Chapter 21 of the book this course is taught from).

The best known algorithm to date for finding MSTs is by David R. Karger (MIT), Philip N. Klein (Brown University) and Robert E. Tarjan (Princeton University). This algorithm was found in 1993, it's randomized in nature and runs in O(E + V) expected time. Here is the publication that presents this algorithm: "A Randomized Linear-Time Algorithm for Finding Minimum Spanning Trees".

You're welcome to watch lecture sixteen:

Topics covered in lecture sixteen:

  • [00:10] Review of graphs.
  • [00:50] Digraph - directed graph.
  • [01:50] Undirected graph.
  • [02:40] Bound of edges in a graph |E| = O(V2).
  • [04:10] Connected graph edges |E| >= |V| - 1.
  • [05:50] Graph representations.
  • [06:15] Adjacency matrix graph representation.
  • [08:20] Example of adjacency matrix of a graph.
  • [09:50] Storage required for an adjacency matrix O(V2). Dense representation.
  • [12:40] Adjecency list graph representation.
  • [13:40] Example of adjecancy list of a graph.
  • [15:50] Handshaking lemma (theorem) for undirected graphs.
  • [18:20] Storage required for adj. list representation O(E + V).
  • [23:30] Minimum spanning trees.
  • [24:20] Definition of MST problem.
  • [28:45] Example of a graph and its minimum spanning tree.
  • [34:50] Optimal substructure property of MST (dynamic programming hallmark #1).
  • [38:10] Theorem that a subgraph's MST is part of whole graph's MST.
  • [42:40] Cut and paste argument proof.
  • [45:40] Overlapping subproblems of MST (dynamic programming hallmark #2).
  • [48:20] Hallmark for greedy algorithms: The greedy choice property - a locally optimal choice is globally optimal.
  • [50:25] Theorem and proof that a local choice of least weight edge is globally optimal for MSTs.
  • [01:01:10] Prim's algorithm for minimum spanning trees.
  • [01:01:50] Key idea of Prim's algorithm.
  • [01:02:55] Pseudocode of Prim's algorithm.
  • [01:07:40] Example of running Prim's algorithm.
  • [01:15:03] Analysis of Prim's algorithm running time.
  • [01:18:20] Running time if we choose array, binary heap and Fibonaci heap.
  • [01:22:40] Kruskal's algorithm.
  • [01:23:00] The best algorithm for MST to date.

Lecture sixteen notes:

MIT Algorithms Lecture 16 Notes Thumbnail. Page 1 of 2.
Lecture 16, page 1 of 2: graphs, graph representations, adjacency lists, minimum spanning trees, cut and paste argument.

MIT Algorithms Lecture 16 Notes Thumbnail. Page 2 of 2.
Lecture 16, page 2 of 2: greedy choice property, prim's algorithm, idea of kruskal's algorithm.

Have fun with greedy programming! The next post will be a trilogy of graph and shortest paths algorithms - Dijkstra's algorithm, Breadth-first search, Bellman-Ford algorithm, Floyd-Warshall algorithm and Johnson's algorithm!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"). Chapter 16 is called "Greedy Algorithms". It explains an activity-selection problem, elements of greedy algorithms, Huffman codes, Matroid theory, and task scheduling problem. Graphs are reviewed in Appendix B of the book. If these topics excite you, you may want to buy this awesome book:

What do you think John McCarthy would say if he saw your code? I don't think he'd like it...

John McCarthy - You’re Programming Completely Wrong
John McCarthy - Programming, You Are Doing It Completely Wrong.

John McCarthy - You’re Programming Completely Wrong
John McCarthy - Programming, You Are Doing It Completely Wrong.

See also difference between Larry Wall and Edsger Dijkstra.

set operationsRemember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a .txt cheat-sheet version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

Download .txt right away: set operations in unix shell (.txt)
Download path: http://www.catonmat.net/download/setops.txt

Set Membership

$ grep -xc 'element' set    # outputs 1 if element is in set
                            # outputs >1 if set is a multi-set
                            # outputs 0 if element is not in set

$ grep -xq 'element' set    # returns 0 (true)  if element is in set
                            # returns 1 (false) if element is not in set

$ awk '$0 == "element" { s=1; exit } END { exit !s }' set
# returns 0 if element is in set, 1 otherwise.

$ awk -v e='element' '$0 == e { s=1; exit } END { exit !s }'

Set Equality

$ diff -q <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2
                                    # returns 1 if set1 != set2

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# collapses multi-sets into sets and does the same as previous

$ awk '{ if (!($0 in a)) c++; a[$0] } END{ exit !(c==NR/2) }' set1 set2
# returns 0 if set1 == set2
# returns 1 if set1 != set2

$ awk '{ a[$0] } END{ exit !(length(a)==NR/2) }' set1 set2
# same as previous, requires >= gnu awk 3.1.5

Set Cardinality

$ wc -l set | cut -d' ' -f1    # outputs number of elements in set

$ wc -l < set

$ awk 'END { print NR }' set

Subset Test

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# outputs something if subset is not a subset of set
# does not putput anything if subset is a subset of set

$ awk 'NR==FNR { a[$0]; next } { if !($0 in a) exit 1 }' set subset
# returns 0 if subset is a subset of set
# returns 1 if subset is not a subset of set

Set Union

$ cat set1 set2     # outputs union of set1 and set2
                    # assumes they are disjoint

$ awk 1 set1 set2   # ditto

$ cat set1 set2 ... setn   # union over n sets

$ cat set1 set2 | sort -u  # same, but assumes they are not disjoint

$ sort set1 set2 | uniq

# sort -u set1 set2

$ awk '!a[$0]++'           # ditto

Set Intersection

$ comm -12 <(sort set1) <(sort set2)  # outputs insersect of set1 and set2

$ grep -xF -f set1 set2

$ sort set1 set2 | uniq -d

$ join <(sort -n A) <(sort -n B)

$ awk 'NR==FNR { a[$0]; next } $0 in a' set1 set2

Set Complement

$ comm -23 <(sort set1) <(sort set2)
# outputs elements in set1 that are not in set2

$ grep -vxF -f set2 set1           # ditto

$ sort set2 set2 set1 | uniq -u    # ditto

$ awk 'NR==FNR { a[$0]; next } !($0 in a)' set2 set1

Set Symmetric Difference

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'
# outputs elements that are in set1 or in set2 but not both

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'

$ sort set1 set2 | uniq -u

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

$ awk 'NR==FNR { a[$0]; next } $0 in a { delete a[$0]; next } 1;
       END { for (b in a) print b }' set1 set2

Power Set

$ p() { [ $# -eq 0 ] && echo || (shift; p "$@") |
        while read r ; do echo -e "$1 $r\n$r"; done }
$ p `cat set`

# no nice awk solution, you are welcome to email me one:
# peter@catonmat.net

Set Cartesian Product

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2

$ awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' set1 set2

Disjoint Set Test

$ comm -12 <(sort set1) <(sort set2)  # does not output anything if disjoint

$ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint
                                             # returns 1 if not

Empty Set Test

$ wc -l < set            # outputs 0  if the set is empty
                         # outputs >0 if the set is not empty

$ awk '{ exit 1 }' set   # returns 0 if set is empty, 1 otherwise

Minimum

$ head -1 <(sort set)    # outputs the minimum element in the set

$ awk 'NR == 1 { min = $0 } $0 < min { min = $0 } END { print min }'

Maximum

$ tail -1 <(sort set)    # outputs the maximum element in the set

$ awk '$0 > max { max = $0 } END { print max }'

Have Fun!

Have fun with these ops! If you can think of other solutions, or have any tips or tricks to add, please comment on the article! Thank you!

Thanks to waldner and pgas from #awk in FreeNode. And greetings to Andreas for coming up with the cool power set function for bash!

Download "Set Operations in Unix Shell" Document

Download .txt document: set operations in unix shell (.txt)
Downloaded: 10899 times
Download URL: http://www.catonmat.net/download/setops.txt

Download .pdf document: set operations in unix shell (.pdf)
Downloaded: 11637 times
Download URL: http://www.catonmat.net/download/setops.pdf

This article is part of the article series "Musical Geek Friday."
<- previous article next article ->

Don’t Copy That FloppyThis week on Musical Geek Friday an anti-piracy video-musical from 1990s! It's called Don't Copy That Floppy.

"Don't Copy That Floppy" was an anti-copyright infringement campaign run by the Software Publishers Association (SPA) beginning in 1992.

In this video two teenagers, Jenny and Corey, are playing a game on a classroom computer. Corey is exuberantly pushing keys and is heavily immersed in the game action; Jenny is beating him. Frustrated, he asks for a rematch, but she has an upcoming class and must leave. He decides he will copy the game so that he can play it at home. Upon inserting his blank floppy disk into the Apple Macintosh computer a video pops up on the computer. This video is of a rapper named MC Double Def DP the "Disk Protector". The DP's role is instructional and he must teach the teenagers that copying games is bad. His method of lecture is a hip-hop style song and dance.

I ripped and edited the audio from the video. You can listen and download it here:

[audio:http://www.catonmat.net/download/dont_copy_that_floppy.mp3]

Download this song: don't copy that floppy (musical geek friday #16)
Downloaded: 22624 times

Download lyrics: don't copy that floppy lyrics (musical geek friday #16)
Downloaded: 5451 times

Here is the whole video, some parts are really boring and you may want to skip those:

Here is just the song that I ripped:

Don't Copy That Floppy lyrics:

(Corey: Jenny, hold up. Look, I brought a disk and we could copy this, ok?
We could play it on my brother computer.
Jenny: Ok, no problem... All we gotta do is... Woah!
Corey: Are you sure you know whatcha doing?)

Did I hear you right, did I hear you sayin'
That you're gonna make a copy of a game without payin'?
Come on, guys, I thought you knew better don't copy that floppy!

Don't don't don't don't...

(Corey: Wait a minute. Who the heck are you, anyway?
Jenny: Yeah. And what are you doing on my computer?)

I'm your MC Double Def DP
That's the Disk Protector for you and the posse
That's your artists, writers, designers and programmers
They pump up the images for games and gramma's that lets you learn, but also play
The games you came here for today

Now I know you love the game and that's alright to do
Because the posse who make them, they love them too
But if you start stealing, there's no more they can do

(Corey: But I just wanted to make one copy!)

You say 'I'll just make a copy, for me and a friend'
Then he'll make one and she'll make one and where will it end?
One leads to another then ten, then more,
And no one buys anything from the store
So no one gets paid and they can't make more
The posse breaks up and they close the door
Don't copy! Don't copy that floppy!

So let me break this down for you:
D-D-Do-Do-Don't

No Carmen Sandiego, no more Oregon Trail
Tetris and the others, they're all gonna fail
Not because we want it but because you're just takin' it
Dis-res-pec-tin' all the folks who are ma-kin' it
The more you take, the less there will be
The disks become fewer, the games fall away
The screen starts to tweak, and then it will fade
Programs fall through a black hole in space
The computer world becomes bleak and stark
Loses its life and the screen goes dark

(Computer: Welcome to the end of the computer age... mwahahahaha...)

But I'm much too strong and you're much too smart
To let that happen to your chances to explore
Parts of the new age just behind the door of your minds
You're the posse of the future and you hold in your brains what's never thought before
And in time, you'll see just so much more
That's why I'm here and that's what I'm fighting for
Don't copy! Don't copy that floppy!

Now let me introduce you, to some of the teams
That will explain a little more about what I mean!

D-D-Do-Do-Don't... Don't copy that floppy!

You see, on these disks we have frozen in time
The creativity of someone's mind
Do you think, that because, with a flick of a key
You can copy that game, that the work is free
This creativity, we protect it by law
We value so highly, what the mind's eye saw
Don't copy! Don't copy that floppy!

D-D-Do-Do-Don't... Don't copy... Don't copy that floppy!

To do the right thing, it's really simple for you
The copyright law, it will tell you what to do
Buy one, for every computer you use
Anything else is like going to the store
Taking the disk, and walking out the door
It's called thiefin', stealin', taking what's not yours
Is that really where you want your life to go?
Think about it, I don't think so.
Don't copy! Don't copy that floppy!

Now you see a game you like and you really want to try it
Don't copy that floppy, just go to the store and buy it
Think of it this way, okay?
When you're buy a disk, you're sayin' to the team
You respect what you do and what you're workin' for
We'll keep up our support so you can make up some more
We'll do the right thing and the future will be clear
There will be new programs here at the end
Don't copy! Don't copy that floppy!

Now you know how the games and the programs are made
And what you do to make sure that they're not gonna fade
The bottom line is it's all up to you
There's nothing more that I can do
The goals in your court, dribble, shoot, or pass
I'm sure you'll make your decision with class

Don't copy that floppy

(MC Double Def DP: See ya, I'm outta here!)

Download "Don't Copy That Floppy" Song

Download this song: don't copy that floppy (musical geek friday #16)
Downloaded: 22624 times

Download lyrics: don't copy that floppy lyrics (musical geek friday #16)
Downloaded: 5451 times

Click to listen:
[audio:http://www.catonmat.net/download/dont_copy_that_floppy.mp3]

Have fun and until next geeky Friday! :)

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the tenth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lecture fifteen, which introduces the concept of Dynamic Programming and applies it to the Longest Common Subsequence problem.

Dynamic programming is a design technique similar to divide and conquer. Divide-and-conquer algorithms partition the problem into independent subproblems, solve the subproblems recursively, and then combine their solutions to solve the original problem. Dynamic programming is applicable when the subproblems are not independent, that is, when subproblems share subsubproblems. A dynamic-programming algorithm solves every subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time the subsubproblem is encountered.

Dynamic programming was systematized by Richard E. Bellman. He began the systematic study of dynamic programming in 1955. The word "programming," both here and in linear programming, refers to the use of a tabular solution method and not to writing computer code.

Dynamic programming is typically applied to optimization problems. In such problems there can be many possible solutions. Each solution has a value, and we wish to find a solution with the optimal (minimum or maximum) value. We call such a solution an optimal solution, as opposed to the optimal solution, since there may be several solutions that achieve the optimal value.

Dynamic programming can be effectively applied to solve the longest common subsequence (LCS) problem. The problem is stated as following: given two sequences (or strings) x and y find a maximum-length common subsequence (substring) of x and y.

For example, given two sequences x = "ABCBDAB" and y = "BDCABA", the LCS(x, y) = { "BCBA", "BDAB", "BCAB" }. As you can see there are several optimal solutions.

Lecture fifteen introduces dynamic programming via this longest common subsequence problem. It first gives a brute-force, exponential time algorithm for solving it. The idea of algorithm is to check every subequence of x[1..m] (m is the length of sequence x) to see if it is also a subsequence of y[1..n] (n is the length of sequence y). Checking takes O(n) time, and there are 2m subsequences of x. The running time thus is exponential O(n·2m). It is no good for large sequences and the lecture continues with a simplification - let's look at the length of a longest-common subseq and then extend this algorithm to find the LCS itself. The simplified algorithm is recursive in nature and computes the same subproblems. At this moment two dynamic programming hallmarks are stated:

  • 1. Optimal substructure: an optimal solution to a problem contains optimal solutions to subproblems.
  • 2. Overlapping subproblems: a recursive solution contains a "small" number of distinct subproblems repeated many times.

As the subproblems are overlapping, the lecture introduces concept of memoization algorithm (note that it's not memorization). A better known word for memoization is caching. The subproblems are cached (memoized) so that they are not recomputed over and over again.

The lecture ends with constructing a dynamic programming table for LCS problem and explains how to find a LCS from this table.

You're welcome to watch lecture fifteen:

Topics covered in lecture fifteen:

  • [00:20] Dynamic programming.
  • [01:47] Longest common subsequence (LCS) problem.
  • [03:55] Example of LCS on sequences "ABCBDAB" and "BDCABA".
  • [06:55] Brute force algorithm for LCS.
  • [07:50] Analysis of brute force algorithm.
  • [11:40] Simplification of LCS problem.
  • [16:20] Theorem about LCS length.
  • [18:25] Proof of the theorem.
  • [30:40] Dynamic programming hallmark #1: Optimal substructure.
  • [32:25] Example of hallmark #1 on LCS.
  • [34:15] Recursive algorithm for longest common subseq.
  • [36:40] Worst case analysis of the algorithm.
  • [38:10] Recursion tree of algorithm.
  • [42:40] Dynamic programming hallmark #2: Overlapping subproblems.
  • [44:40] Example of hallmark #2 on LCS.
  • [45:50] Memoization algorithm for LCS.
  • [48:45] Time and space analysis of memoized algorithm.
  • [54:30] Dynamic programming algorithm for LCS.
  • [01:01:15] Analysis of dynamic programming algorithm.

Lecture fifteen notes:

MIT Algorithms Lecture 15 Notes Thumbnail. Page 1 of 2.
Lecture 15, page 1 of 2.

MIT Algorithms Lecture 15 Notes Thumbnail. Page 2 of 2.
Lecture 15, page 2 of 2.

Have fun with programming dynamically! The next post will be about graphs, greedy algorithms and minimum spanning trees.

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"). Chapter 15 is called "Dynamic Programming" and covers the topics in this lecture. It also explains the assembly-line scheduling problem, matrix-chain multiplication problem, elements of dynamic programming and optimal binary search trees.