This article is part of the article series "Awk One-Liners Explained."
<- previous article next article ->

awk programming one-liners explainedI noticed that Eric Wendelin wrote an article "awk is a beautiful tool." In this article he said that it was best to introduce Awk with practical examples. I totally agree with Eric.

When I was learning Awk, I first went through Awk - A Tutorial and Introduction by Bruce Barnett, which was full of examples to try out; then I created an Awk cheat sheet to have the language reference in front of me; and finally I went through the famous Awk one-liners (link to .txt file), which were compiled by Eric Pement.

This is going to be a three-part article in which I will explain every single one-liner in Mr. Pement's compilation. Each part will explain around 20 one-liners. If you follow closely then the explained examples will turn you into a great Awk programmer.

Eric Pement's Awk one-liner collection consists of five sections:

The first part of the article will explain the first two sections: "File spacing" and "Numbering and calculations." The second part will explain "Text conversion and substitution", and the last part "Selective printing/deleting of certain lines."

I recommend that you print out my Awk cheat sheet before you proceed. This way you will have the language reference in front of you, and you will memorize things better.

These one-liners work with all versions of awk, such as nawk (AT&T's new awk), gawk (GNU's awk), mawk (Michael Brennan's awk) and oawk (old awk).

Awesome news: I have written an e-book based on this article series. Check it out:

Let's start!

1. Line Spacing

1. Double-space a file.

awk '1; { print "" }'

So how does it work? A one-liner is an Awk program and every Awk program consists of a sequence of pattern-action statements "pattern { action statements }". In this case there are two statements "1" and "{ print "" }". In a pattern-action statement either the pattern or the action may be missing. If the pattern is missing, the action is applied to every single line of input. A missing action is equivalent to '{ print }'. Thus, this one-liner translates to:

awk '1 { print } { print "" }'

An action is applied only if the pattern matches, i.e., pattern is true. Since '1' is always true, this one-liner translates further into two print statements:

awk '{ print } { print "" }'

Every print statement in Awk is silently followed by an ORS - Output Record Separator variable, which is a newline by default. The first print statement with no arguments is equivalent to "print $0", where $0 is a variable holding the entire line. The second print statement prints nothing, but knowing that each print statement is followed by ORS, it actually prints a newline. So there we have it, each line gets double-spaced.

2. Another way to double-space a file.

awk 'BEGIN { ORS="\n\n" }; 1'

BEGIN is a special kind of pattern which is not tested against the input. It is executed before any input is read. This one-liner double-spaces the file by setting the ORS variable to two newlines. As I mentioned previously, statement "1" gets translated to "{ print }", and every print statement gets terminated with the value of ORS variable.

3. Double-space a file so that no more than one blank line appears between lines of text.

awk 'NF { print $0 "\n" }'

The one-liner uses another special variable called NF - Number of Fields. It contains the number of fields the current line was split into. For example, a line "this is a test" splits in four pieces and NF gets set to 4. The empty line "" does not split into any pieces and NF gets set to 0. Using NF as a pattern can effectively filter out empty lines. This one liner says: "If there are any number of fields, print the whole line followed by newline."

4. Triple-space a file.

awk '1; { print "\n" }'

This one-liner is very similar to previous ones. '1' gets translated into '{ print }' and the resulting Awk program is:

awk '{ print; print "\n" }'

It prints the line, then prints a newline followed by terminating ORS, which is newline by default.

2. Numbering and Calculations

5. Number lines in each file separately.

awk '{ print FNR "\t" $0 }'

This Awk program appends the FNR - File Line Number predefined variable and a tab (\t) before each line. FNR variable contains the current line for each file separately. For example, if this one-liner was called on two files, one containing 10 lines, and the other 12, it would number lines in the first file from 1 to 10, and then resume numbering from one for the second file and number lines in this file from 1 to 12. FNR gets reset from file to file.

6. Number lines for all files together.

awk '{ print NR "\t" $0 }'

This one works the same as #5 except that it uses NR - Line Number variable, which does not get reset from file to file. It counts the input lines seen so far. For example, if it was called on the same two files with 10 and 12 lines, it would number the lines from 1 to 22 (10 + 12).

7. Number lines in a fancy manner.

awk '{ printf("%5d : %s\n", NR, $0) }'

This one-liner uses printf() function to number lines in a custom format. It takes format parameter just like a regular printf() function. Note that ORS does not get appended at the end of printf(), so we have to print the newline (\n) character explicitly. This one right-aligns line numbers, followed by a space and a colon, and the line.

8. Number only non-blank lines in files.

awk 'NF { $0=++a " :" $0 }; { print }'

Awk variables are dynamic; they come into existence when they are first used. This one-liner pre-increments variable 'a' each time the line is non-empty, then it appends the value of this variable to the beginning of line and prints it out.

9. Count lines in files (emulates wc -l).

awk 'END { print NR }'

END is another special kind of pattern which is not tested against the input. It is executed when all the input has been exhausted. This one-liner outputs the value of NR special variable after all the input has been consumed. NR contains total number of lines seen (= number of lines in the file).

10. Print the sum of fields in every line.

awk '{ s = 0; for (i = 1; i <= NF; i++) s = s+$i; print s }'

Awk has some features of C language, like the for (;;) { ... } loop. This one-liner loops over all fields in a line (there are NF fields in a line), and adds the result in variable 's'. Then it prints the result out and proceeds to the next line.

11. Print the sum of fields in all lines.

awk '{ for (i = 1; i <= NF; i++) s = s+$i }; END { print s+0 }'

This one-liner is basically the same as #10, except that it prints the sum of all fields. Notice how it did not initialize variable 's' to 0. It was not necessary as variables come into existence dynamically. Also notice how it calls "print s+0" and not just "print s". It is necessary if there are no fields. If there are no fields, "s" never comes into existence and is undefined. Printing an undefined value does not print anything (i.e. prints just the ORS). Adding a 0 does a mathematical operation and undef+0 = 0, so it prints "0".

12. Replace every field by its absolute value.

awk '{ for (i = 1; i <= NF; i++) if ($i < 0) $i = -$i; print }'

This one-liner uses two other features of C language, namely the if (...) { ... } statement and omission of curly braces. It loops over all fields in a line and checks if any of the fields is less than 0. If any of the fields is less than 0, then it just negates the field to make it positive. Fields can be addresses indirectly by a variable. For example, i = 5; $i = 'hello', sets field number 5 to string 'hello'.

Here is the same one-liner rewritten with curly braces for clarity. The 'print' statement gets executed after all the fields in the line have been replaced by their absolute values.

awk '{
  for (i = 1; i <= NF; i++) {
    if ($i < 0) {
      $i = -$i;
    }
  }
  print
}'

13. Count the total number of fields (words) in a file.

awk '{ total = total + NF }; END { print total+0 }'

This one-liner matches all the lines and keeps adding the number of fields in each line. The number of fields seen so far is kept in a variable named 'total'. Once the input has been processed, special pattern 'END { ... }' is executed, which prints the total number of fields. See 11th one-liner for explanation of why we "print total+0" in the END block.

14. Print the total number of lines containing word "Beth".

awk '/Beth/ { n++ }; END { print n+0 }'

This one-liner has two pattern-action statements. The first one is '/Beth/ { n++ }'. A pattern between two slashes is a regular expression. It matches all lines containing pattern "Beth" (not necessarily the word "Beth", it could as well be "Bethe" or "theBeth333"). When a line matches, variable 'n' gets incremented by one. The second pattern-action statement is 'END { print n+0 }'. It is executed when the file has been processed. Note the '+0' in 'print n+0' statement. It forces '0' to be printed in case there were no matches ('n' was undefined). Had we not put '+0' there, an empty line would have been printed.

15. Find the line containing the largest (numeric) first field.

awk '$1 > max { max=$1; maxline=$0 }; END { print max, maxline }'

This one-liner keeps track of the largest number in the first field (in variable 'max') and the corresponding line (in variable 'maxline'). Once it has looped over all lines, it prints them out. Warning: this one-liner does not work if all the values are negative.

Here is the fix:

awk 'NR == 1 { max = $1; maxline = $0; next; } $1 > max { max=$1; maxline=$0 }; END { print max, maxline }'

16. Print the number of fields in each line, followed by the line.

awk '{ print NF ":" $0 } '

This one-liner just prints out the predefined variable NF - Number of Fields, which contains the number of fields in the line, followed by a colon and the line itself.

17. Print the last field of each line.

awk '{ print $NF }'

Fields in Awk need not be referenced by constants. For example, code like 'f = 3; print $f' would print out the 3rd field. This one-liner prints the field with the value of NF. $NF is last field in the line.

18. Print the last field of the last line.

awk '{ field = $NF }; END { print field }'

This one-liner keeps track of the last field in variable 'field'. Once it has looped all the lines, variable 'field' contains the last field of the last line, and it just prints it out.

Here is a better version of the same one-liner. It's more common, idiomatic and efficient:

awk 'END { print $NF }'

19. Print every line with more than 4 fields.

awk 'NF > 4'

This one-liner omits the action statement. As I noted in one-liner #1, a missing action statement is equivalent to '{ print }'.

20. Print every line where the value of the last field is greater than 4.

awk '$NF > 4'

This one-liner is similar to #17. It references the last field by NF variable. If it's greater than 4, it prints it out.

Awk one-liners explained e-book

I have written my first e-book called "Awk One-Liners Explained". I improved the explanations of the one-liners in this article series, added new one-liners and added three new chapters - introduction to awk one-liners, summary of awk special variables and idiomatic awk. Please take a look:

Have fun!

That's it for Part I one the article. The second part will be on "Text conversion and substitution."

Have fun learning Awk! It's a fun language to know. :)

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the sixth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lectures nine and ten, which are on the topic of Search Trees.

Search tree data structures provide many dynamic-set operations such as search, insert, delete, minimum element, maximum element and others. The complexity of these operations is proportional to the height of the tree. The better we can balance the tree, the faster they will be. Lectures nine and ten discuss two such structures.

Lecture nine discusses randomly built binary search trees. A randomly built binary search tree is a binary tree that arises from inserting the keys in random order into an initially empty tree. The key result shown in this lecture is that the height of this tree is O(lg(n)).

Lecture ten discusses red-black trees. A red-black tree is a binary search tree with extra bit of information at each node -- it's color, which can be either red or black. By contrasting the way nodes are colored on any path from the root to a leaf, red-black trees ensure that the tree is balanced, giving us guarantees that the operations on this tree will run on O(lg(n)) time!

PS. Sorry for being silent for the past two weeks. I am preparing for job interviews at a company starting with 'G' and it is taking all my time. ;)

Lecture 9: Randomly Built Binary Search Trees

Lecture nine starts with an example of good and bad binary search tree. Given a binary tree with n nodes, a good trees has height log(n) but the bad one has height close to n. As the basic operations on trees run in time proportional to the height of the tree, it's recommended that we build the good trees and not the bad ones.

Before discussing randomly built binary search trees, professor Erik Demaine shows another sorting algorithm. It's called binary search tree sort (BST-sort). It's amazingly simple -- given an array of n items to sort, build a BST out of it and do an in-order tree walk on it. In-order tree walk walks the left branch first, then prints the values, and then walks the right branch. Can you see why the printed list of values is sorted? (If not see the lecture ;) ) [part three of the article series covers sorting algorithms]

Turns out that there is a relation between BST-sort and quicksort algorithm. BST-sort and quicksort make the same comparisons but in different order. [more info on quicksort in part two of article series and in "three beautiful quicksorts" post]

After this discussion, the lecture finally continues with randomized BST-sort which leads to idea of randomly built BSTs.

The other half of the lecture is devoted to a complicated proof of the expected height of a randomly built binary search tree. The result of this proof is that the expected height is order log(n).

You're welcome to watch lecture nine:

Topics covered in lecture nine:

  • [00:50] Good and bad binary search trees (BSTs).
  • [02:00] Binary search tree sort tree algorithm.
  • [03:45] Example of running BST-sort on array (3, 1, 8, 2, 6, 7, 5).
  • [05:45] Running time analysis of BST-sort algorithm.
  • [11:45] BST-sort relation to quicksort algorithm.
  • [16:05] Randomized BST-sort.
  • [19:00] Randomly built binary search trees.
  • [24:58] Theorem: expected height of a rand BST tree is O(lg(n)).
  • [26:45] Proof outline.
  • [32:45] Definition of convex function.
  • [46:55] Jensen's inequality.
  • [55:55] Expected random BST height analysis.

Lecture nine notes:

MIT Algorithms Lecture 9 Notes Thumbnail. Page 1 of 2.
Lecture 9, page 1 of 2.

MIT Algorithms Lecture 9 Notes Thumbnail. Page 2 of 2.
Lecture 9, page 2 of 2.

Lecture 10: Red-Black Trees

Lecture ten begins with a discussion of balanced search trees. Balanced search tree is search tree data structure maintain a dynamic set of n elements using tree of height log(n).

There are many balanced search tree data structures. For example: AVL trees (invented in 1964), 2-3 trees (invented in 1970), 2-3-4 trees, B-trees, red-black trees, skiplists, treaps.

This lecture focuses exclusively on red-black trees.

Red-black trees are binary search trees with extra color field for each node. They satisfy red-black properties:

  • Every node is either red or black.
  • The root and leaves are black.
  • Every red node has a black parent.
  • All simple paths from a node to x to a descendant leaf of x have same number of black nodes = black-height(x).

The lecture gives a proof sketch of the height of an RB-tree and discusses running time of queries (search, min, max, successor, predecessor operations) and then goes into details of update operations (insert, delete). Along the way rotations on a tree are defined, the right-rotate and left-rotate ops.

The other half of the lecture looks at Red-Black-Insert operation that inserts an element in the tree while maintaining the red-black properties.

Here is the video of lecture ten:

Topics covered in lecture ten:

  • [00:35] Balanced search trees.
  • [02:30] Examples of balanced search tree data structures.
  • [05:16] Red-black trees.
  • [06:11] Red-black properties.
  • [11:26] Example of red-black tree.
  • [17:30] Height of red-black tree.
  • [18:50] Proof sketch of RBtree height.
  • [21:30] Connection of red-black trees to 2-3-4 trees.
  • [32:10] Running time of query operations.
  • [35:37] How to do RB-tree updates (inserts, deletes)?
  • [36:30] Tree rotations.
  • [40:55] Idea of red-black tree insert operation.
  • [44:30] Example of inserting an element in a tree.
  • [54:30] RB-Insert algorithm.
  • [01:03:35] The three cases in insert operation.

Lecture ten notes:

MIT Algorithms Lecture 10 Notes Thumbnail. Page 1 of 2.
Lecture 10, page 1 of 2.

MIT Algorithms Lecture 10 Notes Thumbnail. Page 2 of 2.
Lecture 10, page 2 of 2.

Have fun building trees! The next post will be about general methodology of augmenting data structures and it will discuss dynamic order statistics and interval trees!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

Google ChromeAs everyone already knows, Google released a new open-source web browser called Chrome.

Having interest in code reuse, I downloaded the source code and examined all the open-source libraries used.

Google Chrome browser shows excellent example of code reuse. I found that they use at least 25 different software libraries!

Here is the full list of libraries, along with relative paths to source code and short library descriptions. Many of the libraries have been patched by googlers; look for README.google files in each library directory for information about changes.

Library

Relative Path

Description

Google Breakpad

/src/breakpad

An open-source multi-platform crash reporting system.

Google URL

/src/googleurl

A small library for parsing and canonicalizing URLs.

Skia

/src/skia

Vector graphics engine.

Google v8

/src/v8

Google's open source JavaScript engine. V8 implements ECMAScript as specified in ECMA-262, 3rd edition, and runs on Windows XP and Vista, Mac OS X 10.5 (Leopard), and Linux systems that use IA-32 or ARM processors. V8 can run standalone, or can be embedded into any C++ application.

Webkit

/src/webki

Open source web browser engine.

Netscape Portable Runtime (NSPR)

/src/base/third_party/nspr

Netscape Portable Runtime (NSPR) provides a platform-neutral API for system level and libc like functions.

Network Security Services (NSS)

/src/base/third_party/nss

Network Security Services (NSS) is a set of libraries designed to support cross-platform development of security-enabled client and server applications. Applications built with NSS can support SSL v2 and v3, TLS, PKCS #5, PKCS #7, PKCS #11, PKCS #12, S/MIME, X.509 v3 certificates, and other security standards.

Hunspell

/src/chrome/third_
party/hunspell

Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding.

Windows Template Library

/src/chrome/third_party/wtl

C++ library for developing Windows applications and UI components. It extends ATL (Active Template Library) and provides a set of classes for controls, dialogs, frame windows, GDI objects, and more.

Google C++ Testing Framework

/src/testing/gtest

Google's framework for writing C++ tests on a variety of platforms (Linux, Mac OS X, Windows, Windows CE, and Symbian). Based on the xUnit architecture. Supports automatic test discovery, a rich set of assertions, user-defined assertions, death tests, fatal and non-fatal failures, various options for running the tests, and XML test report generation.

bsdiff and bspatch

/src/third_party/bsdiff and /src/third_party/bspatch

bsdiff and bspatch are tools for building and applying patches to binary files.

bzip2

/src/third_party/bzip2

bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding.

International Components for Unicode (ICU)

/src/third_party/icu38

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.

libjpeg

/src/third_party/libjpeg

Library for handling the JPEG (JFIF) image format.

libpng

/src/third_party/libpng

PNG image format library. It supports almost all PNG features, is extensible, and has been extensively tested for over 13 years.

libxml

/src/third_party/libxml

XML C parsing library.

libxslt

/src/third_party/libxslt

XSLT C library.

LZMA

/src/third_party/lzma_sdk

LZMA is the default and general compression method of 7z format in the 7-Zip program.

stringencoders

/src/third_party/modp_b64

A collection of high performance c-string transformations (in this case, base 64 encoding/decoding), frequently 2x faster than standard implementations (if they exist at all).

Netscape Plugin Application Programming Interface (NPAPI)

/src/third_party/npapi

Cross-platform plugin architecture used by many web browsers.

Pthreads-w32

/src/third_party/pthread

Application programming interface (API) for writing multithreaded applications

SCons - a software construction tool

/src/third_party/scons

Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache.

sqlite

/src/third_party/sqlite

Software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.

TLS Lite

/src/third_party/tlslite

Free Python library that implements SSL 3.0, TLS 1.0, and TLS 1.1. TLS Lite supports non-traditional authentication methods such as SRP, shared keys, and cryptoIDs in addition to X.509 certificates. Note: Python is not a part of Chrome. It's used for testing various parts of Chrome browser, such as code coverage, dependencies, measures page load times, compares generated html, etc.

zlib

/src/third_party/zlib

zlib is designed to be a free, general-purpose, legally unencumbered -- that is, not covered by any patents -- lossless data-compression library for use on virtually any computer hardware and operating system.

They have done a really good job making these libraries work together. As someone said, "good coders code, great reuse."

I also found some other exciting things in the source, which I will soon post about. I recommend that you subscribe to my rss feed, if you are interested!

Talking about Chrome, I am waiting for Google to add capability to write extensions for their browser! I already made a list of extensions that I will try to create as soon as they add this feature.

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the fifth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lectures seven and eight, which are on the topic of Hashing.

Many applications require a dynamic set that supports dictionary operations insert, search, and delete. For example, a compiler for a computer language maintains a symbol table, in which the keys of elements are arbitrary strings that correspond to identifiers in the language. A hash table is an effective data structure for implementing dictionaries.

Lectures seven and eight cover various implementation techniques of hash tables and hash functions.

Lecture 7: Hashing I

Lecture seven starts with the symbol-table problem -- given a table S, we'd like to insert an element into S, delete an element from S and search for an element in S. We'd also like these operations to take constant time.

The simplest solution to this problem is to use a direct-access (or direct-address) table. To represent the dynamic set, we use an array, or direct-address table, denoted by T, in which each position, or slot, corresponds to a key.

Using direct-address table, the dictionary operations are trivial to implement.

Direct-Address-Search(T, k)
    return T[k]

Direct-Address-Insert(T, x)
    T[key[x]] = x

Direct-Address-Delete(T, x)
    T[key[x]] = NIL

Direct addressing is applicable when we can afford to allocate an array that has one position for every possible key. It is not applicable when the range of keys can be large (as it requires a lot of space for the array T). This is where hashing comes in.

The lecture continues with explanation of what hashing is. Hashing uses a hash function h(k) that maps keys k randomly into slots of hash-table T. There is one hitch: two keys may hash to the same slot. We call this situation a collision. Fortunately, there are effective techniques for resolving the conflict created by collisions.

One of the simplest collision resolution techniques is called chaining. In chaining, we put all the elements that hash to the same slot in a linked list:

Hashing

Collision resolution by chaining. Each hash-table slot T[j] contains a linked list of all the keys whose hash value is j. For example, h(k1) = h(k4).

Professor Leiserson then analyzes the running time of insert, delete and search operations. It is concluded that the expected running time operations is still O(1), under assumption that the number of hash-table slots is at least proportional to the number of elements in the table.

The other half of the lecture is devoted to hash functions and another way of resolving collisions -- resolving collisions by open addressing, and probing strategies (search) for open addressing -- linear probing and double hashing.

A good hash function should distribute the keys uniformly into the slots of the table and the regularity in the key distributions should not affect uniformity.

Two hash function creating methods are introduced - the division method, which defines h(k) = k mod m, and the multiplication method, where h(k) = (A·k mod 2w)>>(w-r), where w is bits in a word, A is an odd integer between 2w-1 and 2w, and r is lg(m).

You're welcome to watch lecture seven:

Topics covered in lecture seven:

  • [00:30] Symbol-table problem.
  • [02:05] Symbol-table operations: insert, delete, search.
  • [04:35] Direct-address table (direct addressing).
  • [09:45] Hashing.
  • [14:30] Resolving hash function collisions by chaining.
  • [17:05] Worst-case analysis of chaining.
  • [19:15] Average-case analysis of chaning.
  • [29:30] Choosing a hash function.
  • [30:55] Division method hash function.
  • [39:05] Multiplication method hash function.
  • [46:30] Multiplication method explained with a modular wheel.
  • [50:12] Resolving hash function collisions by open addressing.
  • [59:00] Linear probing strategy.
  • [01:01:30] Double hashing probing strategy.
  • [01:04:20] Average-case analysis of open addressing.

Lecture seven notes:

MIT Algorithms Lecture 7 Notes Thumbnail. Page 1 of 2.
Lecture 7, page 1 of 2.

MIT Algorithms Lecture 7 Notes Thumbnail. Page 2 of 2.
Lecture 7, page 2 of 2.

Lecture 8: Hashing II

Lecture eight starts with addressing a weakness of hashing -- for any choice of hash function, there exists a set of keys that all hash to the same value. This weakness can lead to denial of service attacks on the application using hashing.

The idea of addressing this problem is to choose a hash function at random! This is called universal hashing.

The lecture then moves to a mathematically rigorous the definition of universal hashing and explains one of many ways to construct a universal hash function.

The other half of the lecture is devoted to perfect hashing. Perfect hashing solves a problem of constructing a static hash table (such as a hash table stored on a CD), so that searches take O(1) time guaranteed (worst-case). The key idea in creating such hash table is to use 2-level universal hashing, so that no collisions occur in level 2.

Video of lecture eight:

Topics covered in lecture eight:

  • [00:30] Fundamental weakness of hashing.
  • [05:12] Universal hashing.
  • [20:10] Constructing a universal hash function.
  • [49:45] Perfect hashing.
  • [54:30] Example of perfect hashing.
  • [01:06:27] (Markov inequality)
  • [01:14:30] Analysis of storage requirements for perfect hashing.

Lecture eight notes:

MIT Algorithms Lecture 8 Notes Thumbnail. Page 1 of 2.
Lecture 8, page 1 of 2.

MIT Algorithms Lecture 8 Notes Thumbnail. Page 2 of 2.
Lecture 8, page 2 of 2.

Have fun hashing! The next post will be about random binary search trees and red-black trees!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

Linux ifconfig GolfingA friend of mine recently showed me his version of extracting ip address(es) from ifconfig. The ifconfig tool on Linux is used to list IP addresses and configure network interfaces. It is sometimes necessary to capture your IP in a variable, like when writing some firewall rules.

I decided to show off and golf the extraction of IP address with four commonly used tools -- Awk, sed, a version of Perl everyone has, and the latest version of Perl 5.10.

His original version was pretty ugly:

$ ifconfig | perl -ple 'print $_ if /inet addr/ and $_ =~ s/.*inet addr:((?:\d+\.){3}\d+).*/$1/g  ;$_=""' | grep -v ^\s*$

My own version on my firewall was:

$ ifconfig | grep inet | awk -F: '{ print $2 }' | awk '{ print $1 }'

This looks nicer but I was uselessly using grep, and calling awk twice.

Golfing in Perl

My first attempt was to do it with grep and Perl.

$ ifconfig | grep inet | perl -ple '($_) = /addr:([^ ]+)/'
192.168.1.1
127.0.0.1

In a split second I realized that Perl does grep itself:

$ ifconfig | perl -nle '/addr:([^ ]+)/ and print $1'
192.168.1.1
127.0.0.1

Then I noticed that 'dr:' matches the same lines as 'addr:'; also 'and' can be replaced with '&&':

$ ifconfig | perl -nle '/dr:([^ ]+)/ && print $1'
192.168.1.1
127.0.0.1

The regular expression '([^ ]+)' can be replaced with '(\S+)', which matches non-whitespace characters:

$ ifconfig | perl -nle '/dr:(\S+)/ && print $1'
192.168.1.1
127.0.0.1

Cutting the whitespace, the final version for Perl is 37 characters long:

$ ifconfig|perl -nle'/dr:(\S+)/&&print$1'
192.168.1.1
127.0.0.1

Golfing in Perl 5.10

It can be made shorter with Perl 5.10, which introduces the new say function:

$ ifconfig|perl -nE'/dr:(\S+)/&&say$1'
192.168.1.1
127.0.0.1

The result for Perl 5.10 is 34 characters.

Golfing in sed

Then I also tried doing the same with sed:

$ ifconfig | sed -n '/dr:/{;s/.*dr://;s/ .*//;p;}'
192.168.1.1
127.0.0.1

I modified it to strip off everything that was not numbers and dots:

$ ifconfig | sed -n '/dr:/{;s/.*dr:\([0-9.]\+\) .*/\1/;p;}'
192.168.1.1
127.0.0.1

This turned out to be much longer, so I tried getting rid of backslashes, by enabling extended regexes in sed with -r argument:

$ ifconfig | sed -rn '/dr:/{;s/.*dr:([0-9.]+) .*/\1/;p;}'
192.168.1.1
127.0.0.1

I forgot that I had used the ([^ ]+) regex before. Another my friend reminded me this, shortening the sed script to:

$ ifconfig | sed -rn 's/.*r:([^ ]+) .*/\1/p'
192.168.1.1
127.0.0.1

Dropping the whitespace, sed version turned out to be 40 characters long.

Golfing in Awk

My final attempt was to optimize my original Awk version:

$ ifconfig | awk '/dr:/{gsub(/.*:/,"",$2);print$2}'
192.168.1.1
127.0.0.1

Cutting the whitespace, the Awk version is 43 characters.

The Winner

The winner in this golf tournament is Perl 5.10 with 34 chars!

Can You Do Better?

Can you golf this even shorter?

PS. my awk and sed cheat sheets might come handy!