Introduction to Algorithms January 27, 2009

# MIT's Introduction to Algorithms, Lectures 17, 18 and 19: Shortest Path Algorithms

<- previous article next article ->

This is the twelfth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review a trilogy of lectures on graph and shortest path algorithms. They are lectures seventeen, eighteen and nineteen. They'll cover Dijkstra's Algorithm, Breadth-First Search Algorithm and Bellman-Ford Algorithm for finding single-source shortest paths as well as Floyd-Warshall Algorithm and Johnson's Algorithm for finding all-pairs shortest paths.

These algorithms require a thorough understanding of graphs. See the previous lecture for a good review of graphs.

Lecture seventeen focuses on the single-source shortest-paths problem: Given a graph G = (V, E), we want to find a shortest path from a given source vertex s ∈ V to each vertex v ∈ V. In this lecture the weights of edges are restricted to be positive which leads it to Dijkstra's algorithm and in a special case when all edges have unit weight to Breadth-first search algorithm.

Lecture eighteen also focuses on the same single-source shortest-paths problem, but allows edges to be negative. In this case a negative-weight cycles may exist and Dijkstra's algorithm would no longer work and would produce incorrect results. Bellman-Ford algorithm therefore is introduced that runs slower than Dijkstra's but detects negative cycles. As a corollary it is shown that Bellman-Ford solves Linear Programming problems with constraints in form xj - xi <= wij.

Lecture nineteen focuses on the all-pairs shortest-paths problem: Find a shortest path from u to v for every pair of vertices u and v. Although this problem can be solved by running a single-source algorithm once from each vertex, it can be solved faster with Floyd-Warshall algorithm or Johnson's algorithm.

## Lecture 17: Shortest Paths I: Single-Source Shortest Paths and Dijkstra's Algorithm

Lecture seventeen starts with a small review of paths and shortest paths.

It reminds that given a graph G = (V, E, w), where V is a set of vertices, E is a set of edges and w is weight function that maps edges to real-valued weights, a path p from a vertex u to a vertex v in this graph is a sequence of vertices (v0, v1, ..., vk) such that u = v0, v = vk and (vi-1, vi) ∈ E. The weight w(p) of this path is a sum of weights over all edges = w(v0, v1) + w(v1, v2) + ... + w(vk-1, vk). It also reminds that a shortest path from u to v is the path with minimum weight of all paths from u to v, and that a shortest path in a graph might not exist if it contains a negative weight cycle.

The lecture then notes that shortest paths exhibit the optimal substructure property - a subpath of a shortest path is also a shortest path. The proof of this property is given by cut and paste argument. If you remember from previous two lectures on dynamic programming and greedy algorithms, an optimal substructure property suggests that these two techniques could be applied to solve the problem efficiently. Indeed, applying the greedy idea, Dijkstra's algorithm emerges.

Here is a somewhat precise definition of single-source shortest paths problem with non-negative edge weights: Given a graph G = (V, E), and a starting vertex s ∈ V, find shortest-path weights for all vertices v ∈ V.

Here is the greedy idea of Dijkstra's algorithm:

• 1. Maintain a set S of vertices whose shortest-path from s are known (s ∈ S initially).
• 2. At each step add vertex v from the set V-S to the set S. Choose v that has minimal distance from s (be greedy).
• 3. Update the distance estimates of vertices adjacent to v.

I have also posted a video interview with Edsger Dijkstra - Edsger Dijkstra: Discipline in Thought, please take a look if you want to see how Dijkstra looked like. :)

The lecture continues with an example of running Dijkstra's algorithm on a non-trivial graph. It also introduces to a concept of a shortest path tree - a tree that is formed by edges that were last relaxed in each iteration (hard to explain in English, see lecture at 43:40).

The other half of lecture is devoted to three correctness arguments of Dijkstra's algorithm. The first one proves that relaxation never makes a mistake. The second proves that relaxation always makes the right greedy choice. And the third proves that when algorithm terminates the results are correct.

At the final minutes of lecture, running time of Dijkstra's algorithm is analyzed. Turns out that the running time depends on what data structure is used for maintaining the priority queue of the set V-S (step 2). If we use an array, the running time is O(V2), if we use binary heap, it's O(E·lg(V)) and if we use Fibonacci heap, it's O(E + V·lg(V)).

Finally a special case of weighted graphs is considered when all weights are unit weights. In this case a single-source shortest-paths problem can be solved by a the Breadth-first search (BFS) algorithm that is actually a simpler version of Dijkstra's algorithm with priority queue replaced by a FIFO! The running time of BFS is O(V+E).

You're welcome to watch lecture seventeen:

Topics covered in lecture seventeen:

• [01:40] Review of paths.
• [03:15] Edge weight functions.
• [03:30] Example of a path, its edge weights, and weight of the path.
• [04:22] Review of shortest-paths.
• [05:15] Shortest-path weight.
• [06:30] Negative edge weights.
• [10:55] Optimal substructure of a shortest path.
• [11:50] Proof of optimal substructure property: cut and paste.
• [14:23] Triangle inequality.
• [15:15] Geometric proof of triangle inequality.
• [16:30] Single-source shortest paths problem.
• [18:32] Restricted single-source shortest paths problem: all edge weights positive or zero.
• [19:35] Greedy idea for ss shortest paths.
• [26:40] Dijkstra's algorithm.
• [35:30] Example of Dijkstra's algorithm.
• [43:40] Shortest path trees.
• [45:12] Correctness of Dijkstra's algorithm: why relaxation never makes mistake.
• [53:55] Correctness of Dijkstra's algorithm: why relaxation makes progress.
• [01:01:00] Correctness of Dijkstra's algorithm: why it gives correct answer when it terminates.
• [01:15:40] Running time of Dijkstra's algorithm.
• [01:18:40] Running time depending on using array O(V^2), binary heap O(E·lg(V)) and Fibonacci heap O(E + V·lg(V)) for priority queue.
• [01:20:00] Unweighted graphs.
• [01:20:40] Breadth-First Search (BFS) algorithm.
• [01:23:23] Running time of BFS: O(V+E).

Lecture seventeen notes:

 Lecture 17, page 1 of 2: paths, shortest paths, negative edge weights, optimal substructure, triangle inequality, single-source shortest paths, dijkstra's algorithm. Lecture 17, page 2 of 2: example of dijkstra's algorithm, correctness of dijkstra's algorithm, running time of dijkstra, unweighted graphs, breadth first search (bfs).

## Lecture 18: Shortest Paths II: Bellman-Ford Algorithm

Lecture eighteen begins with recalling that if a graph contains a negative weight cycle, then a shortest path may not exist and gives a geometric illustration of this fact.

Right after this fact, it jumps to Bellman-Ford algorithm. The Bellman-Ford algorithm solves the single-source shortest-paths problem in the general case in which edge weights may be negative. Given a weighted, directed graph G = (V, E) with source s and weight function w: E → R, the Bellman-Ford algorithm produces the shortest paths from s and their weights, if there is no negative weight cycle, and it produces no answer if there is a negative weight cycle.

The algorithm uses relaxation, progressively decreasing an estimate on the weight of a shortest path from the source s to each vertex v ∈ V until it achieves the actual shortest-path weight.

The running time of Bellman-Ford algorithm is O(VE). The lecture also gives a correctness proof of Bellman-Ford algorithm.

The other half of the lecture is devoted to a problem that can be effectively solved by Bellman-Ford. It's called the linear feasibility problem that it is a special case of linear programming (LP) problem, where there is no objective but the constraints are in form xi <= wij. It is noted that Bellman-Ford is actually a simple case of LP problem.

The lecture ends with an application of Bellman-Ford to solving a special case of VLSI layout problem in 1 dimension.

You're welcome to watch lecture eighteen:

Topics covered in lecture eighteen:

• [00:20] A long, long time ago... in a galaxy far, far away...
• [00:40] Quick review of previous lecture - Dijkstra's algorithm, non-negative edge weights.
• [01:40] Description of Bellman-Ford algorithm.
• [04:30] Bellman-Ford algorithm.
• [08:50] Running time of Bellman-Ford O(VE).
• [10:05] Example of Bellmen-Ford algorithm.
• [18:40] Correctness of Bellman-Ford algorithm:
• [36:30] Linear programming (LP).
• [42:48] Efficient algorithms for solving LPs: simplex algorithm (exponential in worst case, but practical), ellipsoid algorithm (polynomial time, impractical), interior point methods (polynomial), random sampling (brand new, discovered at MIT).
• [45:58] Linear feasibility problem - LP with no objective.
• [47:30] Difference constraints - constraints in form xj - xi <= wij.
• [49:50] Example of difference constraints.
• [51:04] Constraint graph.
• [54:05] Theorem: Negative weight cycle in constraint means difference constraints are infeasible/unsatisfiable.
• [54:50] Proof.
• [59:15] Theorem: If no negative weight cycle then satisfiable.
• [01:00:23] Proof.
• [01:08:20] Corollary: Bellman-Ford solves a system of of m difference constraints on n variables in O(mn) time.
• [01:12:30] VLSI Layout problem solved by Bellman-Ford.

Lecture eighteen notes:

 Lecture 18, page 1 of 2: bellman-ford algorithm, correctness of bellman-ford, linear programming. Lecture 18, page 2 of 2: feasibility problem, bellman-ford and difference constraints, vlsi layouts.

## Lecture 19: Shortest Paths III: All-Pairs Shortest Paths and Floyd-Warshall Algorithm

Lecture nineteen starts with a quick review of lectures seventeen and eighteen. It reminds the running times of various single-source shortest path algorithms, and mentions that in case of a directed acyclic graphs (which was not covered in previous lectures), you can run topological sort and 1 round of Bellman-Ford that makes it find single-source shortest paths in linear time (for graphs) in O(V+E).

The lecture continues all-pairs shortest paths problem, where we want to know the shortest path between every pair of vertices.

A naive approach to this problem is run single-source shortest path from each vertex. For example, on an unweighted graph we'd run BFS algorithm |V| times that would give O(VE) running time. On a non-negative edge weight graph it would be |V| times Dijkstra's algorithm, giving O(VE + V2lg(V)) time. And in general case we'd run Bellman-Ford |V| times that would make the algorithm run in O(V2E) time.

The lecture continues with a precise definition of all-pairs shortest paths problem: given a directed graph, find an NxN matrix (N = |V|), where each entry aij is the shortest path from vertex i to vertex j.

In general case, if the graph has negative edges and it's dense, the best we can do so far is run Bellman-Ford |V| times. Recalling that E = O(V2) in a dense graph, the running time is O(V2E) = O(V4) - hypercubed in number of vertices = slow.

Lecture then proceeds with a dynamic programming algorithm without knowing if it will be faster or not. It's too complicated to explain here, and I recommend watching lecture at 11:54 to understand it. Turns out this dynamic programming algorithm does not give a performance boost and is still O(V4), but it gives some wicked ideas.

The most wicked idea is to connect matrix multiplication with the dynamic programming recurrence and using repeated squaring to beat O(V4). This craziness gives O(V3lgV) time that is an improvement. Please see 23:40 in the lecture for full explanation.

After all this the lecture arrives at Floyd-Warshall algorithm that finds all-pairs shortest paths in O(V3). The algorithm is derived from a recurrence that the shortest path from vertex i to vertex j is minimum of { shortest path from i to j directly or shortest path from i to k and shortest path from k to j }.

Finally the lecture explains Johnson's algorithm that runs in O(VE + V2log(V)) time for sparse graphs. The key idea in this algorithm is to reweigh all edges so that they are all positive, then run Dijkstra from each vertex and finally undo the reweighing.

It turns out, however, that to find the function for reweighing all edges, a set of difference constraints need to be satisfied. It makes us first run Bellman-Ford to solve these constraints.

Reweighing takes O(EV) time, running Dijkstra on each vertex takes O(VE + V2lgV) and undoing reweighing takes O(V2) time. Of these terms O(VE + V2lgV) dominates and defines algorithm's running time (for dense it's still O(V3).

You're welcome to watch lecture nineteen:

Topics covered in lecture nineteen:

• [01:00] Review of single-source shortest path algorithms.
• [04:45] All-pairs shortest paths by running single-source shortest path algorithms from each vertex.
• [05:35] Unweighted edges: |V|xBFS = O(VE).
• [06:35] Nonnegative edge weights: |V|xDijkstra = O(VE + V2lg(V)).
• [07:40] General case: |V|xBellman-Ford = O(V2E).
• [09:10] Formal definition of all-pairs shortest paths problem.
• [11:08] |V|xBellman-Ford for dense graphs (E = O(V2)) is O(V4).
• [11:54] Trying to beat O(V4) with dynamic programming.
• [19:30] Dynamic programming algorithm, still O(V4).
• [23:40] A better algorithm via wicked analogy with matrix multiplication. Running time: O(V3lg(V)).
• [37:45] Floyd-Warshall algorithm. Runs in O(V3).
• [47:35] Transitive closure problem of directed graphs.
• [53:30] Johnson's algorithm. Runs in O(VE + V2log(V))

Lecture nineteen notes:

 Lecture 19, page 1 of 2: all-pairs shortest paths, dynamic programming algorithm, matrix multiplication analogy. Lecture 19, page 2 of 2: floyd-warshall algorithm, transitive closure problem, johnson's algorithn.

Have fun with shortest path algorithms! The next post is going to be an introduction to parallel algorithms - things like dynamic multithreading, scheduling and multithreaded algorithms.

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"). Chapters 24 and 25, called "Single-Source Shortest Paths" and "All-Pairs Shortest Paths" explain everything I wrote about here in much more detail. If these topics excite you, you may want to buy this awesome book:

Unix Shell January 14, 2009

# Sed One-Liners Explained, Part III: Selective Deletion of Certain Lines and Special Applications

<- previous article next article ->

This is the third and final part of an article on the sed one-liners. This part will explain sed one-liners for selective deletion of certain lines and special applications of sed. See part one for introduction of the series.

Similarly to Awk one-liners, sed one-liners are short and concise sed programs that span less than one terminal line. They were written by Eric Pement and are floating around on the Internet as 'sed1line.txt' file.

If you are intrigued by this article series, I suggest that you subscribe to my posts!

Eric's sed one-liners are divided into several sections:

Grab my "sed cheat sheet" and a local copy of sed1line.txt file, and let the explanation begin!

Awesome news: I have written an e-book based on this article series. Check it out:

## 5. Selective Deletion of Certain Lines

68. Print all lines in the file except a section between two regular expressions.

`sed '/Iowa/,/Montana/d'`

This one-liner continues where the previous left off. One-liner #67 used the range match "/start/,/finish/" to print lines between two regular expressions (inclusive). This one-liner, on the other hand, deletes lines between two regular expressions and prints all the lines outside this range. Just to remind you, a range "/start/,/finish/" matches all lines starting from the first line that matches a regular expression "/start/" to the first line that matches a regular expression "/finish/". In this particular one-liner the "d", delete, command is applied to these lines. The delete command prevents the matching lines from ever seeing the light.

For example, suppose your input to this one-liner was:

```Florida
<strong>Iowa
New York
San Jose
Montana</strong>
Texas
Fairbanks
```

Then after the sed program has finished running, the output is:

```Florida
Texas
Fairbanks
```

We see this output because the lines from Iowa to Montana matched the "/Iowa/,/Montana/" range match (i put the matched lines in bold) and were deleted.

69. Delete duplicate, consecutive lines from a file (emulates "uniq").

`sed '\$!N; /^\(.*\)\n\1\$/!P; D'`

This one-liner acts as the "uniq" Unix utility. So how does it work? First of all, for every line that is not the very last line of input, sed appends the next line to the pattern space by the "N" command. The "N" command is restricted to all but the last line by "\$!" restriction pattern. The newly appended line is separated from the previous line by the "\n" character. Next, the pattern space is matched against "/^\(.*\)\n\1\$/" regular expression. This regular expression captures the previous line up to "\n" character and saves it in the match group "\1". Then it tests if the newly appended line is the same as the previous one. If it is not, the "P" gets executed. If it is, the "P" command does not get executed. The "P" command prints everything in the pattern space up to the first "\n" character. Next the "D" command executes and deletes everything up to the first "\n" char, leaving only the newly read line in pattern space. It also forces the sed script to begin from the first command.

This way it loops over all lines, comparing two consecutive lines. If they are equal, the first line gets deleted, and a new line gets appended to what's left. If they are not equal, the first one gets deleted, and deleted.

I think it's hard to understand what is going on from this description. I'll illustrate it with an example. Suppose this is the input:

```foo
foo
foo
bar
bar
baz
```

The first thing sed does is it reads the first line of input in pattern space. The pattern space now contains "foo". Now the "N" command executed. The pattern space now contains "foo\nfoo". Next the pattern space is tested against "/^\(.*\)\n\1\$/" regular expression. This regular expression matches because "\(.*\)" is "foo" and "/^\(.*\)\n\1\$/" is "foo\nfoo", exactly what we have in the pattern space. As it matched, the "P" command does not get executed. Now the "D" command executes, deleting the everything up to first "\n" from pattern space. The pattern space now contains just "foo". The "D" command forces sed to start from the first command. Now the "N" is executed again, the pattern space now contains "foo\nfoo" again and the same thing happens, "P" does not get executed and "D" deletes the first "foo", leaving the pattern space with just "foo" in it. Now the "N" gets executed once again, this time "bar" gets appended to pattern space. It contains "foo\nbar" now. The regular expression "/^\(.*\)\n\1\$/" does not match and "P" gets executed, printing "foo". After that "D" gets executed wiping "foo" from pattern space. The pattern space now contains "bar". The commands restart and "N" gets executed, it appends the next "bar" to current pattern space. Now it contains "bar\nbar". Just like with "foo\nfoo", nothing gets printed, and "D" deletes the first "bar", leaving pattern space with "bar". The one-liner restarts its execution. Now "N" reads in the final line "baz". The pattern space contains "bar\nbaz" which does not match the regular expression. The "P" prints out the "bar" and "D" deletes "bar". Now "N" does not get executed because we are at the last line of input. The "\$!N" restricts "N" to all lines but last. At this moment pattern space contains only the last "baz", the regular expression does not match, so "baz" gets printed. The "D" command executes, emptying the pattern space. There is no more input and sed quits.

The output for this example is:

```foo
bar
baz
```

I think this is one of the most detailed explanations I have written about a single one liner. :)

70. Delete duplicate, nonconsecutive lines from a file.

`sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'`

This is a very tricky one-liner. It stores the unique lines in hold buffer and at each newly read line, tests if the new line already is in the hold buffer. If it is, then the new line is purged. If it's not, then it's saved in hold buffer for future tests and printed.

A more detailed description - at each line this one-liner appends the contents of hold buffer to pattern space with "G" command. The appended string gets separated from the existing contents of pattern space by "\n" character. Next, a substitution is made to that substitutes the "\n" character with two "\n\n". The substitute command "s/\n/&&/" does that. The "&" means the matched string. As the matched string was "\n", then "&&" is two copies of it "\n\n". Next, a test "/^\([ -~]*\n\).*\n\1/" is done to see if the contents of group capture group 1 is repeated. The capture group 1 is all the characters from space " " to "~" (which include all printable chars). The "[ -~]*" matches that. Replacing one "\n" with two was the key idea here. As "\([ -~]*\n\)" is greedy (matches as much as possible), the double newline makes sure that it matches as little text as possible. If the test is successful, the current input line was already seen and "d" purges the whole pattern space and starts script execution from the beginning. If the test was not successful, the doubled "\n\n" gets replaced with a single "\n" by "s/\n//" command. Then "h" copies the whole string to hold buffer, and "P" prints the new line.

71. Delete all lines except duplicate consecutive lines (emulates "uniq -d").

`sed '\$!N; s/^\(.*\)\n\1\$/\1/; t; D'`

This sed one-liner prints only the duplicate lines. This sed one-liner starts with reading in the next line from input with the "N" command. As I already mentioned, the current line and the next get separated by "\n" character after "N" executes. This one-liner also restrics "N" to all lines but last with "\$!" restriction. Now a substitution "s/^\(.*\)\n\1\$/\1/" is tried. Similarly to one-liner #69, this substitution replaces two repeating strings with one. For example, a string "foo\nfoo" gets replaced with just "foo". Now, if this substitution was successful (there was a repeated string), the "t" command takes the script to the end where the current pattern space gets printed automatically. If the substitution was not successful, "D" executes, deleting the non-repeated string. The cycle continues and this way only the duplicate lines get printed once.

Let's take a look at an example. Suppose the input is:

```foo
foo
bar
baz
```

This one-liner reads the first line and immediately executes the "N" command. The pattern space now is "foo\nfoo". The substitution "s/^\(.*\)\n\1\$/\1/" is tried and it's successful, because "foo" is repeated twice. The pattern space now contains just a single "foo". As the substitution was successful, "t" command branches to the end of the script. At this moment "foo" gets printed. Now the cycle repeats. Sed reads in "bar", the "N" command appends "baz" to "bar". The pattern space now is "bar\nbaz". The substitution is tried, but it's not successful, as "bar" is not repeated. As the substitution failed, "t" does nothing and "D" executes, deleting "bar" from pattern space. The pattern space is left with single "baz". Command "N" no longer executes as we reached end of file, substitution fails, "t" fails, and "D" deletes the "baz".

The end result is:

```foo
```

Just as we expected - only the duplicate line got printed.

72. Delete the first 10 lines of a file.

`sed '1,10d'`

This one-liner restricts the "d" command to a range of lines by number. The "1,10" means a range matching lines 1 to 10 inclusive. On each of the lines the "d" command gets executed. It deletes the current pattern space, and restarts the commands from beginning. The default action for lines > 10 is to print the line.

73. Delete the last line of a file.

`sed '\$d'`

This one-liner restricts the "d" command to the last line of file. It's done by specifying the special char "\$" as the line to match. It matches only the last line. The last line gets deleted, but the others get printed implicitly.

74. Delete the last 2 lines of a file.

`sed 'N;\$!P;\$!D;\$d'`

This one-liner always keeps two lines in the pattern space. At the very last line, it just does not output these last two. All the others before last two get output implicitly. Let's see how it does it. As soon as sed reads the first line of input in pattern space, it executes the first command "N". It places the 2nd line of input in pattern space. The next two commands "\$!P" and "\$!D" print the first part of pattern space up to newline character, and delete this part from pattern space. They keep doing it until the very last line gets appended to pattern space by "N" command. At this moment the last two lines are in pattern space and "\$d" executes, deleting them both. That's it. Last two lines got deleted.

If there is just one line of data, then it outputs it.

75. Delete the last 10 lines of a file.

`sed -e :a -e '\$d;N;2,10ba' -e 'P;D'`

This is really straight forward one-liner. It always keeps 10 lines in pattern-space, by appending each new input line with "N", and deleting the 11th excessive line with "D". Once the end of file is reached, it "d" the whole pattern space, deleting the last 10 lines.

`sed -n -e :a -e '1,10!{P;N;D;};N;ba'`

This is also a straight forward one-liner. For the lines that are not 1-10, it appends them to pattern space with "N". For lines > 10, it prints the first line in pattern space with "P", appends another line with "N" and deletes the printed line with "D". The "D" command causes sed to branch to the beginning of script! The "N;ba" at the end never, ever gets executed again for lines > 10. It keeps looping this way "P", "N", "D", always keeping 10 lines in pattern space and printing line-10 on each cycle. The "N" command causes script to quit if it tries to read past end of file.

76. Delete every 8th line.

`gsed '0~8d'`

This one-liner only works with GNU Sed only. It uses a special address range match "first~step" that matches every step'th line starting with the first. In this one-liner first is 0 and step is 8. Zero is not a valid physical line number, so the very first line of input does not match. The first line to match is 8th, then 16th, then 24th, etc. Each line that matches is deleted by "d" command.

`sed 'n;n;n;n;n;n;n;d;'`

This is a portable version. The "n" command prints the current pattern space, empties it, and reads in the next line. It does so for every 7 lines, and 8th line gets deleted with "d". This process continues until all input has been processed.

77. Delete lines that match regular expression pattern.

`sed '/pattern/d'`

This one-liner executes the "d" command on all lines that match "/pattern/". The "d" command deletes the line and skips to the next line.

78. Delete all blank lines in a file (emulates "grep '.'".

`sed '/^\$/d'`

The regular expression "/^\$/" in this one-liner tests if the beginning of line matches the end of the line. Only the empty lines have this property and sed deletes them.

Another way to do the same is:

`sed '/./!d'`

This one-liner tests if the line matches at least one character. The dot "." in the regular expression matches any character. An empty line does not have any characters and it does not match this regular expression. Sed deletes all the lines that do not match this regular expression.

79. Delete all consecutive blank lines from a file (emulates "cat -s").

`sed '/./,/^\$/!d'`

This one-liner leaves one blank line at the end of the file, if there are multiple blanks at the end. Other than that, all consecutive blanks are stripped.

It uses an inverse range match "/start/,/finish/!" to "d" delete lines from first blank line, to first non-blank, non-inclusive.

`sed '/^\$/N;/\n\$/D'`

This one-liner leaves one blank line at the beginning and end of the file, if there are multiple blanks at both sides. Other than that, all consecutive blanks are stripped.

The consecutive empty lines get appended in pattern space by "/^\$/N" command. The "/\n\$/D" command matches and deletes blanks until only 1 is left. At that moment it no longer matches, and the line is output.

80. Delete all consecutive blank lines from a file except the first two.

`sed '/^\$/N;/\n\$/N;//D'`

In case of > 2 blank lines, this one-liner trims them down to two. There is a catch to this one-liner. Let me explain it first. See the last command "//D"? It's a shortcut for "/previous-match/D". In this case it's shortcut for "/\n\$/D". Alright, now the one-liner itself. On every empty line, it appends the next to current pattern space with "/^\$/N" command. Next it tests if the line just read in was actually a blank line with "/\n\$/", if it is, it reads another line in with "N". At this moment it repeats the same test "/\n\$/". If the line was a blank one again, it deletes the first blank line and restarts sed script from the beginning. Notice that at all times only 2 consecutive blank lines are in pattern space. This way any number of blank lines get deleted and only two are left.

81. Delete all leading blank lines at the top of a file.

`sed '/./,\$!d'`

This one-liner inverts a match "match from the first non-blank line to end of file". It becomes "match from the beginning of file to last blank line".

82. Delete all trailing blank lines at the end of a file.

`sed -e :a -e '/^\n*\$/{\$d;N;ba' -e '}'`

This one-liner accumulates blank lines in pattern space until it either hits end or hits a non-blank line. If it hits end, "\$d" deletes the whole pattern space (which contained just the trailing blank lines) and quits. If however, it hits non-blank line, the whole pattern space gets printed implicitly and script continues as if nothing had happened.

This one is a portable version.

`gsed -e :a -e '/^\n*\$/N;/\n\$/ba'`

This is the same script, except a shorter version, made to work with Gnu Sed.

83. Delete the last line of each paragraph.

`sed -n '/^\$/{p;h;};/./{x;/./p;}'`

This one-liner always keeps the previous line in hold buffer. It's accomplished by 2nd block of commands "/./{x;/./p;}". In this block, the pattern space (1 line) gets exchanged with hold buffer (1 line) by "x" command and if the hold buffer was not empty, it gets printed by "p". The next moment to note is what happens on the first empty line. That is the line after the paragraph. At this moment "/^\$/{p;h;}" gets executed, that prints the blank line (but does not print the last line of paragraph!), and puts the blank line in hold buffer. Once a new paragraph is reached, the script executed just like it was the very first paragraph of the input.

## 6. Special Sed Applications

84. Remove nroff overstrikes.

Nroff overstrikes are chars that are formatted to stand out in bold. They are achieved like in old typewriters, where you would do backspace and hit the same key again. In nroff it's key CHAR, CTRL+H, CHAR. This one-liner deletes the CHAR, CTRL+H, leaving just plain CHAR.

`sed 's/.^H//g'`

Press Ctrl+V and then Ctrl+H to insert ^H literally in sed one-liner. It then uses the substitute command to delete any char "." followed by CTRL+H "^H".

Another way to do the same is use a hex escape expression that works in most recent seds:

`sed 's/.\x08//g'`

Yet another way is to use "echo" and enable interpretation of backslashed characters:

`sed 's/.'`echo -e "\b"`'//g'`

`gsed -r '/^\r?\$/q'`

Usenet, HTTP and Email headers are similar. They are a bunch of text lines, separated from the body of the message with two new lines "\r\n\r\n". Some implementations might even go with just "\n\n". This one-liner quits on the first line that is either empty or contains "\r". In other words, it prints the message header and quits.

86. Print Usenet/HTTP/Email message body.

`sed '1,/^\$/d'`

This one-liner uses a range match "1,/^\$/" to delete lines starting from 1st, and ending with the first blank line (inclusive). As I explained in the previous one-liner #78 above, "/^\$/" matches empty lines. All the lines before first blank line in a Usenet/Email message or a HTTP header are message headers. They get deleted.

87. Extract subject from an email message.

`sed '/^Subject: */!d; s///; q'`

This one-liner deletes all lines that do not match "^Subject: ". Then it re-uses the match in "s///" to delete "Subject: " part from the line, leaving just the real subject. Please notice how "s///" is equivalent to "s/previous-match//", where "previous-match" is "^Subject: *" in this one-liner.

88. Extract sender information from an email message.

`sed '/^From: */!d; s///; q'`

This one liner is equivalent to the previous one, except it prints sender information from email.

89. Extract email address from a "Name Surname <email@domain.com>" string.

`sed 's/.*< *//;s/ *>.*//;`

This one-liner strips all symbols before < symbol (and any whitespace after it), and stips all symbols after > symbol (including whitespace before it). That's it. What's left is email@domain.com.

90. Add a leading angle bracket and space to each line (quote an email message).

`sed 's/^/> /'`

This one-liner substitutes zero-width anchor "^" that matches beginning of line with "> ". As it's a zero-width anchor, the result is that "> " gets added to beginning of each line.

91. Delete leading angle bracket from each line (unquote an email message).

`sed 's/^> //'`

It does what it says, deletes two characters ">" and a space " " from the beginning of each line.

92. Strip HTML tags.

`sed -e :a -e 's/<[^>]*>//g;/</N;//ba'`

Sed is not made for parsing HTML. This is a very crude version of HTML tag eraser. It starts by creating a branch label named "a". Then on each line it substitutes "<[^>]*>" with nothing as many times as possible ("g" flag for s/// command). The "<[^>]*>" expression means match match symbol "<" followed by any other symbols that are not ">", and that ends with ">". This is a common pattern in regular expressions for non-greediness. Next, the one-liner tests if there are any open tags left on the line, if there are "N" reads the next line of input to make it work across multiple lines. "//ba" finally branches to the beginning of the script (it's short for "/previous-expression/ba" which in this case is "/</ba").

## Sed One-Liners Explained E-Book

I have written an e-book called "Sed One-Liners Explained". I improved the explanations of the one-liners in this article series, added new one-liners and added three new chapters - an introduction to sed, a summary of sed addresses and ranges, and debugging sed scripts with sed-sed. Please take a look:

## Have Fun!

This post completes the three part article on the superman of Unix stream editing. I hope that you learned a thing or two and it was my pleasure explaining them.

Just as with the three part article on Awk one-liners explained, my plans are to create a 'sed1line-explained.txt' file that will be supplementary to 'sed1line.txt', and publish an ebook in pdf format.

Howto January 07, 2009

# How to Save Time by Watching Videos at Higher Playback Speeds

This is going to be a small, technical tutorial on how to save a lot of time by watching videos at higher playback rates.

I first read about this idea from my most favorite personal development blog at Steve Pavlina.com. In his post "Overclock Your Audio Learning" he says that he occasionally listens to audios at 4.1x. At this speed 4 hour video/audio can be listened in less 1 hour!

I personally found it impossible to understand anything at 4x speed. My optimal listening speed is 1.65x - 2.1x.

To speed up the videos you will first need to download and install AviSynth. AviSynth is kind of a video programming language with which you can do all kinds of manipulations to videos programmatically. If you are on Windows, then during the installation make sure to associate .avs file extension with Windows Media Player and not Notepad.

Next, create this AviSynth script, and place it in the same directory as your video. Name the script as "speedup.avs" or something similar. Make sure the extension is ".avs" if you are on Windows!

```file = "file_name_of_video.avi"
speedup = 1.65
pitch = 100

DirectShowSource(file)

audio_rate = last.audiorate
video_rate = last.framerate

AssumeSampleRate(int(audio_rate*speedup))
AssumeFPS(video_rate*speedup)
TimeStretch(pitch = pitch)
```

There are three variables that you can change in this simple script. The first is "file". It should be the filename of the video you are about to watch. The next is "speedup". It's the new playback rate, you may set it to any value you wish. For example, if you set it to 2.0, then the video will play twice as fast as it normally would. And the last parameter to change is the "pitch". You may change it to something lower than 100 when the video plays at higher speeds to make the speaker sound lower. I usually keep "speedup" at 1.65 and "pitch" at 75.

Once you have made your own configuration, just double click the .avs on Windows to play it at the new playback speed, or play it through mplayer on Linux!

Update: My blog readers bobb and crb suggested two new techniques on how to watch videos faster. Bobb suggested just to use mplayer. Turns out it already has an option to play videos faster!

```mplayer -speed 1.65 file.avi

# use keys [ ], and { } to control the playback speed
# use backspace to reset video speed to normal.
```

Crb suggested to use MySpeed™ Plug-In for YouTube to speed up video on YouTube in real time.

Here are a few examples that I crafted at speeds 1x, 1.4x, 1.65x, 2x, 2.25x, 2.5x, 2.75x and 3x. They are from Alan Kay's keynote speech "The computer revolution hasn't happend yet" at OOPSLA Conference in 1997.

After a few minutes of listening at higher speeds, try going back to listen at 1x. It will seem incredibly slow because your mind will have adapted to the faster input rate.

Alan Kay's talk at a normal speed. Total time: 5 mins.

Alan Kay's talk at 1.4x speed. Total time: 3:34 mins.

Alan Kay's talk at 1.65x speed. Total time: 3 mins.

Alan Kay's talk at 2x speed. Total time: 2:30 mins.

Alan Kay's talk at 2.25x speed. Total time: 2:13 mins.

Alan Kay's talk at 2.5x speed. Total time: 2 mins.

Alan Kay's talk at 2.75x speed. Total time: 1:49 mins.

Alan Kay's talk at 3x speed. Total time: 1:40 mins.

Do you have any other techniques to speed up videos? I am also curious at what speeds do you feel the most comfortable watching the videos?

It would also be cool to create a hack that modifies youtube and google video players to make them play videos faster natively.

Ps. Did you know that I was a big fan of video lectures? I have been collecting math, physics, computer science video lectures for almost 3 years now and posting them to my other Free Science Online Blog.

Awk Programming January 05, 2009

# Awk One-Liners Explained, Part III: Selective Printing and Deleting of Certain Lines

<- previous article next article ->

This is the third and final part of a three-part article on the Awk one-liners. This part will explain Awk one-liners for selective printing and deletion of certain lines. See part one for introduction of the series.

If you just came to my website, then you might wonder, "What are these Awk one-liners and why are they famous?" The answer is very simple - they are small and beautiful Awk programs that do one and only text manipulation task very well. They have been circulating around the Internet as awk1line.txt text file and they have been written by Eric Pement.

If you are intrigued by this article series, I suggest that you subscribe to my posts, as I will have a lot more interesting and educational articles this year.

Eric Pement's Awk one-liner collection consists of five sections:

Awesome news: I have written an e-book based on this article series. Check it out:

Grab my Awk cheat sheet and the local copy of Awk one-liners file awk1line.txt and let's roll.

## 4. Selective Printing of Certain Lines

45. Print the first 10 lines of a file (emulates "head -10").

`awk 'NR < 11'`

Awk has a special variable called "NR" that stands for "Number of Lines seen so far in the current file". After reading each line, Awk increments this variable by one. So for the first line it's 1, for the second line 2, ..., etc. As I explained in the very first one-liner, every Awk program consists of a sequence of pattern-action statements "pattern { action statements }". The "action statements" part get executed only on those lines that match "pattern" (pattern evaluates to true). In this one-liner the pattern is "NR < 11" and there are no "action statements". The default action in case of missing "action statements" is to print the line as-is (it's equivalent to "{ print \$0 }"). The pattern in this one-liner is an expression that tests if the current line number is less than 11. If the line number is less than 11, Awk prints the line. As soon as the line number is 11 or more, the pattern evaluates to false and Awk skips the line.

A much better way to do the same is to quit after seeing the first 10 lines (otherwise we are looping over lines > 10 and doing nothing):

`awk '1; NR == 10 { exit }'`

The "NR == 10 { exit }" part guarantees that as soon as the line number 10 is reached, Awk quits. For lines smaller than 10, Awk evaluates "1" that is always a true-statement. And as we just learned, true statements without the "action statements" part are equal to "{ print \$0 }" that just prints the first ten lines!

46. Print the first line of a file (emulates "head -1").

`awk 'NR > 1 { exit }; 1'`

This one-liner is very similar to previous one. The "NR > 1" is true only for lines greater than one, so it does not get executed on the first line. On the first line only the "1", the true statement, gets executed. It makes Awk print the line and read the next line. Now the "NR" variable is 2, and "NR > 1" is true. At this moment "{ exit }" gets executed and Awk quits. That's it. Awk printed just the first line of the file.

47. Print the last 2 lines of a file (emulates "tail -2").

`awk '{ y=x "\n" \$0; x=\$0 }; END { print y }'`

Okay, so what does this one do? First of all, notice that "{y=x "\n" \$0; x=\$0}" action statement group is missing the pattern. When the pattern is missing, Awk executes the statement group for all lines. For the first line, it sets variable "y" to "\nline1" (because x is not yet defined). For the second line it sets variable "y" to "line1\nline2". For the third line it sets variable "y" to "line2\nline3". As you can see, for line N it sets the variable "y" to "lineN-1\nlineN". Finally, when it reaches EOF, variable "y" contains the last two lines and they get printed via "print y" statement.

Thinking about this one-liner for a second one concludes that it is very ineffective - it reads the whole file line by line just to print out the last two lines! Unfortunately there is no seek() statement in Awk, so you can't seek to the end-2 lines in the file (that's what tail does). It's recommended to use "tail -2" to print the last 2 lines of a file.

48. Print the last line of a file (emulates "tail -1").

`awk 'END { print }'`

This one-liner may or may not work. It relies on an assumption that the "\$0" variable that contains the entire line does not get reset after the input has been exhausted. The special "END" pattern gets executed after the input has been exhausted (or "exit" called). In this one-liner the "print" statement is supposed to print "\$0" at EOF, which may or may not have been reset.

It depends on your awk program's version and implementation, if it will work. Works with GNU Awk for example, but doesn't seem to work with nawk or xpg4/bin/awk.

The most compatible way to print the last line is:

`awk '{ rec=\$0 } END{ print rec }'`

Just like the previous one-liner, it's computationally expensive to print the last line of the file this way, and "tail -1" should be the preferred way.

49. Print only the lines that match a regular expression "/regex/" (emulates "grep").

`awk '/regex/'`

This one-liner uses a regular expression "/regex/" as a pattern. If the current line matches the regex, it evaluates to true, and Awk prints the line (remember that missing action statement is equal to "{ print }" that prints the whole line).

50. Print only the lines that do not match a regular expression "/regex/" (emulates "grep -v").

`awk '!/regex/'`

Pattern matching expressions can be negated by appending "!" in front of them. If they were to evaluate to true, appending "!" in front makes them evaluate to false, and the other way around. This one-liner inverts the regex match of the previous (#49) one-liner and prints all the lines that do not match the regular expression "/regex/".

51. Print the line immediately before a line that matches "/regex/" (but not the line that matches itself).

`awk '/regex/ { print x }; { x=\$0 }'`

This one-liner always saves the current line in the variable "x". When it reads in the next line, the previous line is still available in the "x" variable. If that line matches "/regex/", it prints out the variable x, and as a result, the previous line gets printed.

It does not work, if the first line of the file matches "/regex/", in that case, we might want to print "match on line 1", for example:

`awk '/regex/ { print (x=="" ? "match on line 1" : x) }; { x=\$0 }'`

This one-liner tests if variable "x" contains something. The only time that x is empty is at very first line. In that case "match on line 1" gets printed. Otherwise variable "x" gets printed (that as we found out contains the previous line). Notice that this one-liner uses a ternary operator "foo?bar:baz" that is short for "if foo, then bar, else baz".

52. Print the line immediately after a line that matches "/regex/" (but not the line that matches itself).

`awk '/regex/ { getline; print }'`

This one-liner calls the "getline" function on all the lines that match "/regex/". This function sets \$0 to the next line (and also updates NF, NR, FNR variables). The "print" statement then prints this next line. As a result, only the line after a line matching "/regex/" gets printed.

If it is the last line that matches "/regex/", then "getline" actually returns error and does not set \$0. In this case the last line gets printed itself.

53. Print lines that match any of "AAA" or "BBB", or "CCC".

`awk '/AAA|BBB|CCC/'`

This one-liner uses a feature of extended regular expressions that support the | or alternation meta-character. This meta-character separates "AAA" from "BBB", and from "CCC", and tries to match them separately on each line. Only the lines that contain one (or more) of them get matched and printed.

54. Print lines that contain "AAA" and "BBB", and "CCC" in this order.

`awk '/AAA.*BBB.*CCC/'`

This one-liner uses a regular expression "AAA.*BBB.*CCC" to print lines. This regular expression says, "match lines containing AAA followed by any text, followed by BBB, followed by any text, followed by CCC in this order!" If a line matches, it gets printed.

55. Print only the lines that are 65 characters in length or longer.

`awk 'length > 64'`

This one-liner uses the "length" function. This function is defined as "length([str])" - it returns the length of the string "str". If none is given, it returns the length of the string in variable \$0. For historical reasons, parenthesis () at the end of "length" can be omitted. This one-liner tests if the current line is longer than 64 chars, if it is, the "length > 64" evaluates to true and line gets printed.

56. Print only the lines that are less than 64 characters in length.

`awk 'length < 64'`

This one-liner is almost byte-by-byte equivalent to the previous one. Here it tests if the length if line less than 64 characters. If it is, Awk prints it out. Otherwise nothing gets printed.

57. Print a section of file from regular expression to end of file.

`awk '/regex/,0'`

This one-liner uses a pattern match in form 'pattern1, pattern2' that is called "range pattern". The 3rd Awk Tip from article "10 Awk Tips, Tricks and Pitfalls" explains this match very carefully. It matches all the lines starting with a line that matches "pattern1" and continuing until a line matches "pattern2" (inclusive). In this one-liner "pattern1" is a regular expression "/regex/" and "pattern2" is just 0 (false). So this one-liner prints all lines starting from a line that matches "/regex/" continuing to end-of-file (because 0 is always false, and "pattern2" never matches).

58. Print lines 8 to 12 (inclusive).

`awk 'NR==8,NR==12'`

This one-liner also uses a range pattern in format "pattern1, pattern2". The "pattern1" here is "NR==8" and "pattern2" is "NR==12". The first pattern means "the current line is 8th" and the second pattern means "the current line is 12th". This one-liner prints lines between these two patterns.

59. Print line number 52.

`awk 'NR==52'`

This one-liner tests to see if current line is number 52. If it is, "NR==52" evaluates to true and the line gets implicitly printed out (patterns without statements print the line unmodified).

The correct way, though, is to quit after line 52:

`awk 'NR==52 { print; exit }'`

This one-liner forces Awk to quit after line number 52 is printed. It is the correct way to print line 52 because there is nothing else to be done, so why loop over the whole doing nothing.

60. Print section of a file between two regular expressions (inclusive).

`awk '/Iowa/,/Montana/'`

I explained what a range pattern such as "pattern1,pattern2" does in general in one-liner #57. In this one-liner "pattern1" is "/Iowa/" and "pattern2" is "/Montana/". Both of these patterns are regular expressions. This one-liner prints all the lines starting with a line that matches "Iowa" and ending with a line that matches "Montana" (inclusive).

## 5. Selective Deletion of Certain Lines

There is just one one-liner in this section.

61. Delete all blank lines from a file.

`awk NF`

This one-liner uses the special NF variable that contains number of fields on the line. For empty lines, NF is 0, that evaluates to false, and false statements do not get the line printed.

Another way to do the same is:

`awk '/./'`

This one-liner uses a regular-expression match "." that matches any character. Empty lines do not have any characters, so it does not match.

## Awk one-liners explained e-book

I have written my first e-book called "Awk One-Liners Explained". I improved the explanations of the one-liners in this article series, added new one-liners and added three new chapters - introduction to awk one-liners, summary of awk special variables and idiomatic awk. Please take a look:

## Have Fun!

This concludes the article series about Awk one-liners. I hope that you enjoyed this three-part article and it made you a better Awk programmer!

My future plans are to create a awk1line-explained.txt that will be a supplementary file to the famous awk1line.txt. I am also thinking about publishing a nicely formatted pdf e-book about all the one-liners.

If you liked this article, you may also like a very similar article on Sed One-Liners Explained.

And finally, if you notice anything that you can't understand, please let me know in the comments. Thank you!

Programming December 25, 2008

# Christmas Tree in the Shell

Merry Christmas everyone!

If you're celebrating Christmas at your Unix console, wouldn't it be fun to have a Christmas tree in your shell? It sure would! Follow these steps to have your own Christmas tree in the shell.

## Step 1: Install Acme::POE::Tree Perl Module.

Type the following at your shell, it will install Acme::POE::Tree Perl module:

```perl -MCPAN -e 'install Acme::POE::Tree'
```

If you get notified that you are missing dependencies, answer 'yes' to have them installed.

## Step 2: Celebrate Christmas at Console.

Type this to have your Christmas tree up and running:

```perl -MAcme::POE::Tree -e 'Acme::POE::Tree->new()->run()'
```

The result:

Merry Christmas!

Ps. I created a geeky wish list at Amazon.com. I'd appreciate a Christmas or New Year present. The list is here: Peter's Amazon.com Wish List. Thank you! :)