**93**Comments December 13, 2008

# Famous Awk One-Liners Explained, Part II: Text Conversion and Substitution

This is the second part of a three-part article on the **famous Awk one-liners**. This part will explain Awk one-liners for **text conversion and substitution**. See part one for introduction of the series.

"What are these famous Awk one-liners?", you might wonder? Well, they are concise and beautiful Awk programs that span no more than 70 characters (less than one terminal line). They were written by Eric Pement and are floating around on the Internet as '**awk1line.txt**' file.

If you are intrigued by this article series, I suggest that you subscribe to my posts!

Eric Pement's Awk one-liner collection consists of five sections:

- 1. File spacing (explained in part one).
- 2. Numbering and calculations (explained in part one).
**3. Text conversion and substitution (explained in this part).**- 4. Selective printing of certain lines (explained in part three).
- 5. Selective deleting of certain lines (explained in part three).
- 6. String creation, array creation and update on selective printing of certain lines (explained in bonus article).
- 7. Release of Awk One-Liners Explained e-book.

I recommend that you print out my Awk Cheat Sheet before you proceed. This way you will have the language reference in front of you, and you will memorize it better.

**Awesome news:** I have written an e-book based on this article series. Check it out:

Grab the local copy of Awk one-liners file here awk1line.txt and let's roll.

## 3. Text Conversion and Substitution

**21. Convert Windows/DOS newlines (CRLF) to Unix newlines (LF) from Unix.**

awk '{ sub(/\r$/,""); print }'

This one-liner uses the sub(regex, repl, [string]) function. This function substitutes the first instance of regular expression "regex" in string "string" with the string "repl". If "string" is omitted, variable $0 is used. Variable $0, as I explained in the first part of the article, contains the entire line.

The one-liner replaces '\r' (CR) character at the end of the line with nothing, i.e., erases CR at the end. Print statement prints out the line and appends ORS variable, which is '\n' by default. Thus, a line ending with CRLF has been converted to a line ending with LF.

**22. Convert Unix newlines (LF) to Windows/DOS newlines (CRLF) from Unix.**

awk '{ sub(/$/,"\r"); print }'

This one-liner also uses the sub() function. This time it replaces the zero-width anchor '$' at the end of the line with a '\r' (CR char). This substitution actually adds a CR character to the end of the line. After doing that Awk prints out the line and appends the ORS, making the line terminate with CRLF.

**23. Convert Unix newlines (LF) to Windows/DOS newlines (CRLF) from Windows/DOS.**

awk 1

This one-liner may work, or it may not. It depends on the implementation. If the implementation catches the Unix newlines in the file, then it will read the file line by line correctly and output the lines terminated with CRLF. If it does not understand Unix LF's in the file, then it will print the whole file and terminate it with CRLF (single windows newline at the end of the whole file).

Ps. Statement '1' (or anything that evaluates to true) in Awk is syntactic sugar for '{ print }'.

**24. Convert Windows/DOS newlines (CRLF) to Unix newlines (LF) from Windows/DOS**

gawk -v BINMODE="w" '1'

Theoretically this one-liner should convert CRLFs to LFs on DOS. There is a note in GNU Awk documentation that says: "Under DOS, gawk (and many other text programs) silently translates end-of-line "\r\n" to "\n" on input and "\n" to "\r\n" on output. A special "BINMODE" variable allows control over these translations and is interpreted as follows: ... If "BINMODE" is "w", then binary mode is set on write (i.e., no translations on writes)."

My tests revealed that no translation was done, so you can't rely on this BINMODE hack.

Eric suggests to better use the "tr" utility to convert CRLFs to LFs on Windows:

tr -d \r

The 'tr' program is used for translating one set of characters to another. Specifying -d option makes it delete all characters and not do any translation. In this case it's the '\r' (CR) character that gets erased from the input. Thus, CRLFs become just LFs.

**25. Delete leading whitespace (spaces and tabs) from the beginning of each line (ltrim).**

awk '{ sub(/^[ \t]+/, ""); print }'

This one-liner also uses sub() function. What it does is replace regular expression "^[ \t]+" with nothing "". The regular expression "^[ \t]+" means - match one or more space " " or a tab "\t" at the beginning "^" of the string.

**26. Delete trailing whitespace (spaces and tabs) from the end of each line (rtrim).**

awk '{ sub(/[ \t]+$/, ""); print }'

This one-liner is very similar to the previous one. It replaces regular expression "[ \t]+$" with nothing. The regular expression "[ \t]+$" means - match one or more space " " or a tab "\t" at the end "$" of the string. The "+" means "one or more".

**27. Delete both leading and trailing whitespaces from each line (trim).**

awk '{ gsub(/^[ \t]+|[ \t]+$/, ""); print }'

This one-liner uses a new function called "gsub". Gsub() does the same as sub(), except it performs as many substitutions as possible (that is, it's a global sub()). For example, given a variable f = "foo", sub("o", "x", f) would replace just one "o" in variable f with "x", making f be "fxo"; but gsub("o", "x", f) would replace both "o"s in "foo" resulting "fxx".

The one-liner combines both previous one-liners - it replaces leading whitespace "^[ \t]+" and trailing whitespace "[ \t]+$" with nothing, thus trimming the string.

To remove whitespace between fields you may use this one-liner:

awk '{ $1=$1; print }'

This is a pretty tricky one-liner. It seems to do nothing, right? Assign $1 to $1. But no, when you change a field, Awk rebuilds the $0 variable. It takes all the fields and concats them, separated by OFS (single space by default). All the whitespace between fields is gone.

**28. Insert 5 blank spaces at beginning of each line.**

awk '{ sub(/^/, " "); print }'

This one-liner substitutes the zero-length beginning of line anchor "^" with five empty spaces. As the anchor is zero-length and matches the beginning of line, the five whitespace characters get appended to beginning of the line.

**29. Align all text flush right on a 79-column width.**

awk '{ printf "%79s\n", $0 }'

This one-liner asks printf() to print the string in $0 variable and left pad it with spaces until the total length is 79 chars.

Please see the documentation of printf function for more information and examples.

**30. Center all text on a 79-character width.**

awk '{ l=length(); s=int((79-l)/2); printf "%"(s+l)"s\n", $0 }'

First this one-liner calculates the length() of the line and puts the result in variable "l". Length(var) function returns the string length of var. If the variable is not specified, it returns the length of the entire line (variable $0). Next it calculates how many white space characters to pad the line with and stores the result in variable "s". Finally it printf()s the line with appropriate number of whitespace chars.

For example, when printing a string "foo", it first calculates the length of "foo" which is 3. Next it calculates the column "foo" should appear which (79-3)/2 = 38. Finally it printf("%41", "foo"). Printf() function outputs 38 spaces and then "foo", making that string centered (38*2 + 3 = 79)

**31. Substitute (find and replace) "foo" with "bar" on each line.**

awk '{ sub(/foo/,"bar"); print }'

This one-liner is very similar to the others we have seen before. It uses the sub() function to replace "foo" with "bar". Please note that it replaces just the first match. To replace all "foo"s with "bar"s use the gsub() function:

awk '{ gsub(/foo/,"bar"); print }'

Another way is to use the gensub() function:

gawk '{ $0 = gensub(/foo/,"bar",4); print }'

This one-liner replaces only the 4th match of "foo" with "bar". It uses a never before seen gensub() function. The prototype of this function is gensub(regex, s, h[, t]). It searches the string "t" for "regex" and replaces "h"-th match with "s". If "t" is not given, $0 is assumed. Unlike sub() and gsub() it returns the modified string "t" (sub and gsub modified the string in-place).

Gensub() is a non-standard function and requires GNU Awk or Awk included in NetBSD.

In this one-liner regex = "/foo/", s = "bar", h = 4, and t = $0. It replaces the 4th instance of "foo" with "bar" and assigns the new string back to the whole line $0.

**32. Substitute "foo" with "bar" only on lines that contain "baz".**

awk '/baz/ { gsub(/foo/, "bar") }; { print }'

As I explained in the first one-liner in the first part of the article, every Awk program consists of a sequence of pattern-action statements "pattern { action statements }". Action statements are applied only to lines that match pattern.

In this one-liner the pattern is a regular expression /baz/. If line contains "baz", the action statement gsub(/foo/, "bar") is executed. And as we have learned, it substitutes all instances of "foo" with "bar". If you want to substitute just one, use the sub() function!

**33. Substitute "foo" with "bar" only on lines that do not contain "baz".**

awk '!/baz/ { gsub(/foo/, "bar") }; { print }'

This one-liner negates the pattern /baz/. It works exactly the same way as the previous one, except it operates on lines that do not contain match this pattern.

**34. Change "scarlet" or "ruby" or "puce" to "red".**

awk '{ gsub(/scarlet|ruby|puce/, "red"); print}'

This one-liner makes use of extended regular expression alternation operator | (pipe). The regular expression /scarlet|ruby|puce/ says: match "scarlet" or "ruby" or "puce". If the line matches, gsub() replaces all the matches with "red".

**35. Reverse order of lines (emulate "tac").**

awk '{ a[i++] = $0 } END { for (j=i-1; j>=0;) print a[j--] }'

This is the trickiest one-liner today. It starts by recording all the lines in the array "a". For example, if the input to this program was three lines "foo", "bar", and "baz", then the array "a" would contain the following values: a[0] = "foo", a[1] = "bar", and a[2] = "baz".

When the program has finished processing all lines, Awk executes the END { } block. The END block loops over the elements in the array "a" and prints the recorded lines. In our example with "foo", "bar", "baz" the END block does the following:

for (j = 2; j >= 0; ) print a[j--]

First it prints out j[2], then j[1] and then j[0]. The output is three separate lines "baz", "bar" and "foo". As you can see the input was reversed.

**36. Join a line ending with a backslash with the next line.**

awk '/\\$/ { sub(/\\$/,""); getline t; print $0 t; next }; 1'

This one-liner uses regular expression "/\\$/" to look for lines ending with a backslash. If the line ends with a backslash, the backslash gets removed by sub(/\\$/,"") function. Then the "getline t" function is executed. "Getline t" reads the next line from input and stores it in variable t. "Print $0 t" statement prints the original line (but with trailing backslash removed) and the newly read line (which was stored in variable t). Awk then continues with the next line. If the line does not end with a backslash, Awk just prints it out with "1".

Unfortunately this one liner fails to join more than 2 lines (this is left as an exercise to the reader to come up with a one-liner that joins arbitrary number of lines that end with backslash :)).

**37. Print and sort the login names of all users.**

awk -F ":" '{ print $1 | "sort" }' /etc/passwd

This is the first time we see the -F argument passed to Awk. This argument specifies a character, a string or a regular expression that will be used to split the line into fields ($1, $2, ...). For example, if the line is "foo-bar-baz" and -F is "-", then the line will be split into three fields: $1 = "foo", $2 = "bar" and $3 = "baz". If -F is not set to anything, the line will contain just one field $1 = "foo-bar-baz".

Specifying -F is the same as setting the FS (Field Separator) variable in the BEGIN block of Awk program:

awk -F ":" # is the same as awk 'BEGIN { FS=":" }'

/etc/passwd is a text file, that contains a list of the system's accounts, along with some useful information like login name, user ID, group ID, home directory, shell, etc. The entries in the file are separated by a colon ":".

Here is an example of a line from /etc/passwd file:

pkrumins:x:1000:100:Peteris Krumins:/home/pkrumins:/bin/bash

If we split this line on ":", the first field is the username (pkrumins in this example). The one-liner does just that - it splits the line on ":", then forks the "sort" program and feeds it all the usernames, one by one. After Awk has finished processing the input, sort program sorts the usernames and outputs them.

**38. Print the first two fields in reverse order on each line.**

awk '{ print $2, $1 }' file

This one liner is obvious. It reverses the order of fields $1 and $2. For example, if the input line is "foo bar", then after running this program the output will be "bar foo".

**39. Swap first field with second on every line.**

awk '{ temp = $1; $1 = $2; $2 = temp; print }'

This one-liner uses a temporary variable called "temp". It assigns the first field $1 to "temp", then it assigns the second field to the first field and finally it assigns "temp" to $2. This procedure swaps the first two fields on every line. For example, if the input is "foo bar baz", then the output will be "bar foo baz".

Ps. This one-liner was incorrect in Eric's awk1line.txt file. "Print" was missing.

**40. Delete the second field on each line.**

awk '{ $2 = ""; print }'

This one liner just assigns empty string to the second field. It's gone.

**41. Print the fields in reverse order on every line.**

awk '{ for (i=NF; i>0; i--) printf("%s ", $i); printf ("\n") }'

We saw the "NF" variable that stands for Number of Fields in the part one of this article. After processing each line, Awk sets the NF variable to number of fields found on that line.

This one-liner loops in reverse order starting from NF to 1 and outputs the fields one by one. It starts with field $NF, then $(NF-1), ..., $1. After that it prints a newline character.

**42. Remove duplicate, consecutive lines (emulate "uniq")**

awk 'a !~ $0; { a = $0 }'

Variables in Awk don't need to be initialized or declared before they are being used. They come into existence the first time they are used. This one-liner uses variable "a" to keep the last line seen "{ a = $0 }". Upon reading the next line, it compares if the previous line (in variable "a") is not the same as the current one "a !~ $0". If it is not the same, the expression evaluates to 1 (true), and as I explained earlier, any true expression is the same as "{ print }", so the line gets printed out. Then the program saves the current line in variable "a" again and the same process continues over and over again.

This one-liner is actually incorrect. It uses a regular expression matching operator "!~". If the previous line was something like "fooz" and the new one is "foo", then it won't get output, even though they are not duplicate lines.

Here is the correct, fixed, one-liner:

awk 'a != $0; { a = $0 }'

It compares lines line-wise and not as a regular expression.

**43. Remove duplicate, nonconsecutive lines.**

awk '!a[$0]++'

This one-liner is very idiomatic. It registers the lines seen in the associative-array "a" (arrays are always associative in Awk) and at the same time tests if it had seen the line before. If it had seen the line before, then a[line] > 0 and !a[line] == 0. Any expression that evaluates to false is a no-op, and any expression that evals to true is equal to "{ print }".

For example, suppose the input is:

foo bar foo baz

When Awk sees the first "foo", it evaluates the expression "!a["foo"]++". "a["foo"]" is false, but "!a["foo"]" is true - Awk prints out "foo". Then it increments "a["foo"]" by one with "++" post-increment operator. Array "a" now contains one value "a["foo"] == 1".

Next Awk sees "bar", it does exactly the same what it did to "foo" and prints out "bar". Array "a" now contains two values "a["foo"] == 1" and "a["bar"] == 1".

Now Awk sees the second "foo". This time "a["foo"]" is true, "!a["foo"]" is false and Awk does not print anything! Array "a" still contains two values "a["foo"] == 2" and "a["bar"] == 1".

Finally Awk sees "baz" and prints it out because "!a["baz"]" is true. Array "a" now contains three values "a["foo"] == 2" and "a["bar"] == 1" and "a["baz"] == 1".

The output:

foo bar baz

Here is another one-liner to do the same. Eric in his one-liners says it's the most efficient way to do it.

awk '!($0 in a) { a[$0]; print }'

It's basically the same as previous one, except that it uses the 'in' operator. Given an array "a", an expression "foo in a" tests if variable "foo" is in "a".

Note that an empty statement "a[$0]" creates an element in the array.

**44. Concatenate every 5 lines of input with a comma.**

awk 'ORS=NR%5?",":"\n"'

We saw the ORS variable in part one of the article. This variable gets appended after every line that gets output. In this one-liner it gets changed on every 5th line from a comma to a newline. For lines 1, 2, 3, 4 it's a comma, for line 5 it's a newline, for lines 6, 7, 8, 9 it's a comma, for line 10 a newline, etc.

## Awk one-liners explained e-book

I have written my first e-book called "Awk One-Liners Explained". I improved the explanations of the one-liners in this article series, added new one-liners and added three new chapters - introduction to awk one-liners, summary of awk special variables and idiomatic awk. Please take a look:

## Have Fun!

Have fun with these one-liners. I hope you learned something new.

If you liked this article, you may also like a very similar article on Famous Sed One-Liners Explained.

Ps. If you notice anything that you can't understand, please let me know in the comments. Thanks!

Hi all. I am starting a new article series called "**Vim Plugins You Should Know About**". This series of articles is going to be about Vim plugins that you should know about and perhaps even be using. The first article in this series will be about one of my favorite plugins called "**surround.vim**".

If you are intrigued by this article series, I suggest that you subscribe to my posts!

Don't worry, I'll finish all the previous article series that I have started (creating and marketing a video downloading tool, famous awk one-liners explained, sed one-liners explained and mit algorithms). I just wanted to utilize my excitement and get this article published ASAP.

Okay, so what is surround.vim plugin? Here is what Tim Pope, the author of this plugin, says:

Surround.vim is all about "surroundings": parentheses, brackets, quotes, XML tags, and more. The plugin provides mappings to easily delete, change and add such surroundings in pairs.

That's what the plugin does. It allows you to easily delete, change and add various "surroundings" in pairs. For example, you may quickly wrap a line in an html tag, or you may delete a pair of { }, or you may add quotes around words, etc.

Let's take a look at a few examples for each surrounding command: delete surrounding, change surrounding, and add surrounding. In the examples **|** indicates the position of the cursor.

**Examples of deleting surroundings:**

Surroundings can be deleted with the "**ds**" command. After you type "**ds**", the command expects the surrounding target you want to delete. It may be any of three quote marks, ', ", and `, the eight punctuation marks, (, ), {, }, [, ], <, > , and a special 't' target for deleting the innermost HTML tag.

Text Command New Text --------------- ------- ----------- 'Hello W|orld' ds' Hello World (12|3+4*56)/2 ds( 123+4*56/2 <div>fo|o</div> dst foo (|is the position of cursor in these examples)

See how easy it was to delete surroundings? Just 3 keystrokes. Compare it with doing it old fashioned way:

Text Command New Text --------------- ------- ----------- 'Hello W|orld' F'x,x Hello World (12|3+4*56)/2 Bxf)x 123+4*56/2 <div>fo|o</div> Bdf>wdf> foo (|is the position of cursor in these examples)

That was really messy, wasn't it? I hope you start to see now how handy surround.vim is!

**Examples of changing surroundings:**

Surroundings can be changed with the "**cs**" command. The "**cs**" command, just like "**ds**" command takes a surrounding target, and it also takes the surrounding replacement. There are a few more surrounding targets for this command. They are w for word, W for word + skip punctuation, s for sentence, and p for paragraph.

Text Command New Text --------------- ------- ----------- "Hello|world!" cs"' 'Hello world!' "Hello|world!" cs"<q> <q>Hello world!</q> (123+4|56)/2 cs)] [123+456]/2 (123+4|56)/2 cs)[ [ 123+456 ]/2 <div>foo|</div> cst<p> <p>foo</p> fo|o! csw' 'foo'! fo|o! csW' 'foo!' (|is the position of cursor in these examples)

**Examples of adding surroundings:**

Surroundings can be added with the same "**cs**" command, which takes a surrounding target, or with the "**ys**" command that takes a valid vim motion. There is a special "**yss**" command that applies a surrounding to the whole line, and "**ySS**" that applies the surrounding to the whole line, places the text on a new line and indents it.

Text Command New Text --------------- ------- ----------- Hello w|orld! ysiw) Hello (world)! Hello w|orld! csw) Hello (world)! fo|o ysiwt<html> <html>foo</html> foo quu|x baz yss" "foo quux baz" foo quu|x baz ySS" " foo quux baz " (|is the position of cursor in these examples)

You can also add surroundings while in insert mode. That is supported via <**CTRL-s**> mapping.

Please be **very careful** with **CTRL-s**. On many terminals it stops the output and your session will appear to be frozen! If that happens, use CTRL-q to unfreeze it. You may want to remove CTRL+s mapping from your terminal altogether. Add "stty stop ''" to your shell startup file (.bashrc for bash, .kshrc for ksh, etc).

For example (while in insert mode):

Command New Text ------- ------------ <CTRL-s>" "" <CTRL-s><CTRL-s><html> <html>|</html> (|is the position of cursor in these examples)

I urge you to try these mappings out. You won't become a hockey player by just looking at the game, you have to play it yourself!

For a complete reference, here are all the commands/mappings surround.vim provides:

Normal mode ----------- ds - delete a surrounding cs - change a surrounding ys - add a surrounding yS - add a surrounding and place the surrounded text on a new line + indent it yss - add a surrounding to the whole line ySs - add a surrounding to the whole line, place it on a new line + indent it ySS - same as ySs Visual mode ----------- s - in visual mode, add a surrounding S - in visual mode, add a surrounding but place text on new line + indent it Insert mode ----------- <CTRL-s> - in insert mode, add a surrounding <CTRL-s><CTRL-s> - in insert mode, add a new line + surrounding + indent <CTRL-g>s - same as <CTRL-s> <CTRL-g>S - same as <CTRL-s><CTRL-s>

## How to install surround.vim?

- 1. Download surround.zip. It will actually be a zip file containing the surround.vim script and documentation.
- 2. Extract surround.zip to ~/.vim (on Unix/Linux), or ~\vimfiles (Windows).
- 3. Restart Vim (or just source surround.vim script with ":so ~/.vim/plugin/surround.vim" on Unix or ":so ~/vimfiles/plugin/surround.vim" on Windows).
- 4. Regenerate helptags with ":helptags ~/.vim/doc" on Unix or ":helptags ~/vimfiles/doc" on Windoze.

Type ":help surround" to read help.

## Comments appreciated!

What plugins do you use? Your comments will help me and my blog readers to find more plugins which we might have not heard about. You may also share any tips and aha-moments that you have had while using vim. I might add my own and turn them into a valuable article. Thanks!

**48**Comments December 04, 2008

# MIT's Introduction to Algorithms, Lecture 16: Greedy Algorithms

This is the eleventh post in an article series about MIT's lecture course "**Introduction to Algorithms**." In this post I will review lecture sixteen, which introduces the concept of **Greedy Algorithms**, reviews **Graphs** and applies the greedy **Prim's Algorithm** to the **Minimum Spanning Tree (MST) Problem**.

The previous lecture introduced dynamic programming. Dynamic programming was used for finding solutions to optimization problems. In such problems there can be many possible solutions. Each solution has a value, and we wish to find a solution with the optimal (minimum or maximum) value. Greedy algorithms are another set of methods for finding optimal solution. A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution. Greedy algorithms do not always yield optimal solutions, but for many problems they do. In this lecture it is shown that a greedy algorithm gives an optimal solution to the MST problem.

Lecture sixteen starts with a review of **graphs**. It reminds that a graph G is a pair (V, E), where V is a set of **vertices** (**nodes**) and E ⊆ VxV is a set of **edges**. If E contains ordered pairs, the graph is **directed** (also called a **digraph**), otherwise it's **undirected**. It also gives facts, such as, for all graphs, |E| = O(|V|^{2}), and if a graph is connected then |E| >= |V| - 1. It also mentions **handshaking lemma** - sum over all vertices of degree of the vertex = 2|E| (Σ_{v∈V}degree(v) = 2|E|).

The lecture continues with graph representations. Graphs are commonly stored as **adjacency matrices** or as **adjacency lists**. Adjacency matrix of a graph G = (V, E) is an nxn matrix A with elements a_{ij} = 1, if edge (i, j) ∈ E and a_{ij} = 0, if edge (i, j) ∉ E.

Here is an example of adjacency matrix for a digraph G = ({1, 2, 3, 4}, {{1, 2}, {1, 3}, {2, 3}, {4, 3}}):

You can see that the element a_{ij} of matrix A is 1 if there is an edge from i to j, and the element is 0 if there is no edge from i to j (i is row, j is column). The storage space required by this representation is O(|V|^{2}). For dense graphs this is ok, but for sparse it's not ok.

Adjacency list of a given vertex v ∈ V is the list Adj[v] of vertices adjacent to v. Here is an example of adjacency list for the same graph:

Storage required by this representation is O(V+E).

The lecture progresses with the problem of **minimum spanning trees (MSTs)**. It's stated as following: given a connected, undirected graph G = (V, E) with edge weight function w: E -> **R**, find a spanning tree that connects all the vertices of minimum weight.

Here is an example of a MST. The MST of the graph is colored in green. The MST connects all the vertices of the graph so that the weight of edges is minimum.

Analyzing the MST problem in more detail, it is appears that MST contains optimal substructure property - if we take a subgraph and look at its MST, that MST will be a part of the MST of the whole graph. This is dynamic programming hallmark #1 (see previous lecture). It also appears that the problem of finding an MST contains overlapping subproblems, which is dynamic programming hallmark #2. Hallmarks #1 and #2 suggest that we could use a dynamic programming algorithm to find a MST of a graph. Surprisingly, there is something more powerful than dynamic programming algorithm for this problem - a greedy algorithm.

There is a hallmark for greedy algorithms:

**Greedy choice property**- a locally optimal choice is globally optimal.

Lecture continues with a theorem and a proof that a greedy local choice for a MST is globally optimal choice. This is the key idea behind **Prim's algorithm** for finding a minimum spanning tree of a graph. The idea in the algorithm is to maintain a priority queue Q (keyed on minimum weight of edge connected to a vertex) of vertices that are not yet connected in the MST. At each step the algorithm takes a vertex from Q and goes over all edges, connects the edge with minimum weight to MST and updates keys of neighbor vertices (please see the lecture or lecture notes for the exact algorithm and an example).

The lecture ends with running time analysis of Prim's algorithm. As it uses a priority queue, the running time depends on implementation of the priority queue used. For example, if we use a dumb array as a priority queue, the running time is O(V^{2}), if we use a binary heap, the running time os O(E·lg(V)), if we use a Fibonacci heap, the running time is O(E + V·lg(V)).

Another well known algorithm for MSTs is **Kruskal's algorithm** (runs in O(E·lg(V)), which uses a disjoint-set data structure, that is not covered in this course (but is covered in Chapter 21 of the book this course is taught from).

The best known algorithm to date for finding MSTs is by David R. Karger (MIT), Philip N. Klein (Brown University) and Robert E. Tarjan (Princeton University). This algorithm was found in 1993, it's randomized in nature and runs in O(E + V) expected time. Here is the publication that presents this algorithm: "A Randomized Linear-Time Algorithm for Finding Minimum Spanning Trees".

You're welcome to watch lecture sixteen:

Topics covered in lecture sixteen:

- [00:10] Review of graphs.
- [00:50] Digraph - directed graph.
- [01:50] Undirected graph.
- [02:40] Bound of edges in a graph |E| = O(V
^{2}). - [04:10] Connected graph edges |E| >= |V| - 1.
- [05:50] Graph representations.
- [06:15] Adjacency matrix graph representation.
- [08:20] Example of adjacency matrix of a graph.
- [09:50] Storage required for an adjacency matrix O(V
^{2}). Dense representation. - [12:40] Adjecency list graph representation.
- [13:40] Example of adjecancy list of a graph.
- [15:50] Handshaking lemma (theorem) for undirected graphs.
- [18:20] Storage required for adj. list representation O(E + V).
- [23:30] Minimum spanning trees.
- [24:20] Definition of MST problem.
- [28:45] Example of a graph and its minimum spanning tree.
- [34:50] Optimal substructure property of MST (dynamic programming hallmark #1).
- [38:10] Theorem that a subgraph's MST is part of whole graph's MST.
- [42:40] Cut and paste argument proof.
- [45:40] Overlapping subproblems of MST (dynamic programming hallmark #2).
- [48:20] Hallmark for greedy algorithms: The greedy choice property - a locally optimal choice is globally optimal.
- [50:25] Theorem and proof that a local choice of least weight edge is globally optimal for MSTs.
- [01:01:10] Prim's algorithm for minimum spanning trees.
- [01:01:50] Key idea of Prim's algorithm.
- [01:02:55] Pseudocode of Prim's algorithm.
- [01:07:40] Example of running Prim's algorithm.
- [01:15:03] Analysis of Prim's algorithm running time.
- [01:18:20] Running time if we choose array, binary heap and Fibonaci heap.
- [01:22:40] Kruskal's algorithm.
- [01:23:00] The best algorithm for MST to date.

Lecture sixteen notes:

Have fun with greedy programming! The next post will be a trilogy of graph and shortest paths algorithms - Dijkstra's algorithm, Breadth-first search, Bellman-Ford algorithm, Floyd-Warshall algorithm and Johnson's algorithm!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"). Chapter 16 is called "Greedy Algorithms". It explains an activity-selection problem, elements of greedy algorithms, Huffman codes, Matroid theory, and task scheduling problem. Graphs are reviewed in Appendix B of the book. If these topics excite you, you may want to buy this awesome book:

What do you think John McCarthy would say if he saw your code? I don't think he'd like it...

John McCarthy - Programming, You Are Doing It Completely Wrong.

John McCarthy - Programming, You Are Doing It Completely Wrong.

Remember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a **.txt cheat-sheet** version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

Download .txt right away: set operations in unix shell (.txt)

Download path: **http://www.catonmat.net/download/setops.txt**

## Set Membership

$ grep -xc 'element' set # outputs 1 if element is in set # outputs >1 if set is a multi-set # outputs 0 if element is not in set $ grep -xq 'element' set # returns 0 (true) if element is in set # returns 1 (false) if element is not in set $ awk '$0 == "element" { s=1; exit } END { exit !s }' set # returns 0 if element is in set, 1 otherwise. $ awk -v e='element' '$0 == e { s=1; exit } END { exit !s }'

## Set Equality

$ diff -q <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2 # returns 1 if set1 != set2 $ diff -q <(sort set1 | uniq) <(sort set2 | uniq) # collapses multi-sets into sets and does the same as previous $ awk '{ if (!($0 in a)) c++; a[$0] } END{ exit !(c==NR/2) }' set1 set2 # returns 0 if set1 == set2 # returns 1 if set1 != set2 $ awk '{ a[$0] } END{ exit !(length(a)==NR/2) }' set1 set2 # same as previous, requires >= gnu awk 3.1.5

## Set Cardinality

$ wc -l set | cut -d' ' -f1 # outputs number of elements in set $ wc -l < set $ awk 'END { print NR }' set

## Subset Test

$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1 # outputs something if subset is not a subset of set # does not putput anything if subset is a subset of set $ awk 'NR==FNR { a[$0]; next } { if !($0 in a) exit 1 }' set subset # returns 0 if subset is a subset of set # returns 1 if subset is not a subset of set

## Set Union

$ cat set1 set2 # outputs union of set1 and set2 # assumes they are disjoint $ awk 1 set1 set2 # ditto $ cat set1 set2 ... setn # union over n sets $ cat set1 set2 | sort -u # same, but assumes they are not disjoint $ sort set1 set2 | uniq # sort -u set1 set2 $ awk '!a[$0]++' # ditto

## Set Intersection

$ comm -12 <(sort set1) <(sort set2) # outputs insersect of set1 and set2 $ grep -xF -f set1 set2 $ sort set1 set2 | uniq -d $ join <(sort -n A) <(sort -n B) $ awk 'NR==FNR { a[$0]; next } $0 in a' set1 set2

## Set Complement

$ comm -23 <(sort set1) <(sort set2) # outputs elements in set1 that are not in set2 $ grep -vxF -f set2 set1 # ditto $ sort set2 set2 set1 | uniq -u # ditto $ awk 'NR==FNR { a[$0]; next } !($0 in a)' set2 set1

## Set Symmetric Difference

$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g' # outputs elements that are in set1 or in set2 but not both $ comm -3 <(sort set1) <(sort set2) | tr -d '\t' $ sort set1 set2 | uniq -u $ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1) $ grep -vxF -f set1 set2; grep -vxF -f set2 set1 $ awk 'NR==FNR { a[$0]; next } $0 in a { delete a[$0]; next } 1; END { for (b in a) print b }' set1 set2

## Power Set

$ p() { [ $# -eq 0 ] && echo || (shift; p "$@") | while read r ; do echo -e "$1 $r\n$r"; done } $ p `cat set` # no nice awk solution, you are welcome to email me one: # peter@catonmat.net

## Set Cartesian Product

$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2 $ awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' set1 set2

## Disjoint Set Test

$ comm -12 <(sort set1) <(sort set2) # does not output anything if disjoint $ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint # returns 1 if not

## Empty Set Test

$ wc -l < set # outputs 0 if the set is empty # outputs >0 if the set is not empty $ awk '{ exit 1 }' set # returns 0 if set is empty, 1 otherwise

## Minimum

$ head -1 <(sort set) # outputs the minimum element in the set $ awk 'NR == 1 { min = $0 } $0 < min { min = $0 } END { print min }'

## Maximum

$ tail -1 <(sort set) # outputs the maximum element in the set $ awk '$0 > max { max = $0 } END { print max }'

## Have Fun!

Have fun with these ops! If you can think of other solutions, or have any tips or tricks to add, please comment on the article! Thank you!

Thanks to waldner and pgas from #awk in FreeNode. And greetings to Andreas for coming up with the cool power set function for bash!

Download "**Set Operations in Unix Shell**" Document

Download .txt document: set operations in unix shell (.txt)

Downloaded: 11017 times

Download URL: http://www.catonmat.net/download/setops.txt

Download .pdf document: set operations in unix shell (.pdf)

Downloaded: 11777 times

Download URL: http://www.catonmat.net/download/setops.pdf