coding horror keyword analysisI have subscribed to quite a few programming blogs, one of them being Coding Horror.

Coding Horror is written by a guy named Jeff Atwood (you probably knew that already), and his blog has received massive attention, bringing 93 thousand (woah!) feed subscribers (as of April, 2008).

One thing that caught my attention on CodingHorror blog is that its traffic stats are publicly available!

coding horror traffic statistics

The statistics are hosted by StatCounter.com, which keeps only the last 500 entries of any traffic activity.

recent keyword activity on codinghorror

I wanted to see a clearer picture of the most popular keywords people searched for and ended up in Coding Horror blog.

Thirty minutes later I had written a Perl program, which accessed the statcounter.com statistics, parsed the "Recent Keyword Activity" page, extracted the keywords, and inserted them in an SQLite database.

I always love to describe how my programs work. I'll make it short this time, as we are concentrating on the statistics and not on programming.

The Perl Program

The Perl program uses (or reuses) a few CPAN modules:

The program takes two optional arguments

  • -nodb not to insert the keywords in database (just print them out)
  • number - number of pages to extract keywords from

Here is the source code of the codinghorror_kwstats.pl program:

#!/usr/bin/perl
#
# Peteris Krumins (peter@catonmat.net), 2008
# http://www.catonmat.net  --  good coders code, great reuse
#
# Access codinghorror.com traffic statistics and extract a few pages of latest search queries
# Released under GNU GPL
# 2008.04.08: Version 1.0
#

#
# run it as 'perl codinghorror_kwstats.pl [-nodb] [number of pages to extract]'
# -nodb specifies not to insert keywords in database, just print them to stdout
#

use strict;
use warnings;

use DBI;
use WWW::Mechanize;
use HTML::TreeBuilder;
use Date::Parse;

# URL to publicly available codinghorror's statcounter stats
my $login_url = 'http://my.statcounter.com/project/standard/stats.php?project_id=2600027&guest=1';

# Query used to INSERT a new keyword in the database
my $insert_query = 'INSERT OR IGNORE INTO queries (query, unix_date, human_date) VALUES (?, ?, ?)';

# Path to SQLite database
my $db_path = 'codinghorror.db';

# Insert queries in database or not? Default, yes.
my $do_db = 1;

# Number of pages of keywords to extract. Default 1.
my $pages = 1;

for (@ARGV) {
    $pages = $_ if /^\d+$/;
    $do_db = 0 if /-nodb/;
}

my $dbh;
$dbh = DBI->connect("dbi:SQLite:$db_path", '', '', { RaiseError => 1 }) if $do_db;

my $mech = WWW::Mechanize->new();
my $login_req = $mech->get($login_url);

unless ($mech->success) {
    print STDERR "Failed getting $login_url:\n";
    print $login_req->message, "\n";
    exit 1;
}

unless ($mech->content =~ /Coding Horror/i) {
    # Could not access Coding Horror's stats
    print STDERR "Failed accessing Coding Horror stats\n";
    exit 1;
}

my $kw_req = $mech->follow_link(text => 'Recent Keyword Activity');
unless ($mech->success) {
    print STDERR "Couldn't find 'Recent Keyword Activity' link";
    print $kw_req->message, "\n";
    exit 1;
}

for my $page (1..$pages) {
    my $tree = HTML::TreeBuilder->new_from_content($mech->content);
    my $td_main_panel = $tree->look_down('_tag' => 'td', 'class' => 'mainPanel');
    unless ($td_main_panel) {
        print STDERR "Unable to find '<td class=mainPanel>'";
        exit 1;
    }
    my $table = $td_main_panel->look_down('_tag' => 'table', 'class' => 'standard');
    unless ($table) {
        print STDERR "Unable to find 'table' tag";
        exit 1;
    }
    my @trs = $table->look_down('_tag' => 'tr');
    my $idx = 0;
    for my $tr (@trs) {
        next unless $idx++;
        my @tds = $tr->look_down('_tag' => 'td');
        unless (@tds == 6) {
            print STDERR "<td> count was not 6!\n";
            next;
        }
        my ($date, $time, $query) = map { $_->as_text } (@tds[1..2], $tds[4]);
        next unless $query;
        my $year = (localtime)[5] + 1900;
        my $ydt = "$date $year $time";
        my $unix_date = str2time($ydt);
        print "$date $year $time: $query\n";
        $dbh->do($insert_query, undef, $query, $unix_date, $ydt) if $do_db;
    }
    if ($page != $pages) {
        my $page_req = $mech->follow_link(text => $page + 1);
        unless ($page_req) {
            print STDERR "Couldn't find page ", $page + 1, " of keywords", "\n";
            exit 1;
        }
    }
}

Download: coding horror keyword scraper (downloaded: 2998 times)

Here is an example run of the program:

$ ./codinghorror_kwstats.pl -nodb 2
8 Apr 2008 03:50:54: media player
8 Apr 2008 03:50:53: physical working environment programmers
8 Apr 2008 03:50:26: nano itx case
8 Apr 2008 03:50:23: how to clean some internet spyware or adware infection
8 Apr 2008 03:50:23: mercurial install tutorial windows
8 Apr 2008 03:50:22: iis 5.1 multiple websites
8 Apr 2008 03:50:17: javascript integer manipulation comparision
8 Apr 2008 03:50:16: build machines pc
8 Apr 2008 03:50:14: manage remote desktop connections
8 Apr 2008 03:50:07: check that all variables are initialized
8 Apr 2008 03:50:00: powergrep older version
8 Apr 2008 03:49:43: software counterfeiting
8 Apr 2008 03:48:59: floppy emulator windows xp
8 Apr 2008 03:48:35: safari rendering cleartype
8 Apr 2008 03:48:18: captchas goole broken
8 Apr 2008 03:48:11: vs2005 ide color
8 Apr 2008 03:47:55: optimising dual core for cubase sx3
8 Apr 2008 03:47:44: micosoft project scheduling
8 Apr 2008 03:47:36: dont buy from craig at australian computer resellers
8 Apr 2008 03:47:32: large scale stored procedures
8 Apr 2008 03:47:31: free diff tool
8 Apr 2008 03:46:58: games that support 3 monitors
8 Apr 2008 03:46:56: firefox multiple times same stylesheet
8 Apr 2008 03:46:48: asp.net system.data.sqltypes.sqlnullvalueexception
8 Apr 2008 03:46:37: apple software serial code blocker
8 Apr 2008 03:46:31: beautiful code jon bentley
8 Apr 2008 03:46:28: system.web.httpparseexception
8 Apr 2008 03:46:23: round in c#.net
8 Apr 2008 03:46:15: project postmortem software
8 Apr 2008 03:45:43: programming fun
8 Apr 2008 03:45:33: sending messages over ip using command prompt
8 Apr 2008 03:45:26: where did horror develop?

The SQLite Database

The database has just one table called 'queries' which contains a 'query', 'unix_date' and 'human_date' columns. The 'unix_date' column is used for sorting the entries chronologically, and 'human_date' is there just so I could easily see the date.

Here is the schema of the database:

CREATE TABLE queries (id INTEGER PRIMARY KEY, query TEXT, unix_date INTEGER, human_date TEXT);
CREATE UNIQUE INDEX unique_query_date ON queries (query, unix_date);

As the Perl program is run periodically, it might extract the same keywords several times. I created a UNIQUE index on 'query' and 'unix_date' fields, and left the job to drop the duplicate records to SQLite.

The Perl program uses the following SQL query to insert the data in database:

INSERT OR IGNORE INTO queries (query, unix_date, human_date) VALUES (?, ?, ?)

The 'OR IGNORE' makes sure the duplicate records get silently discarded.

Simple Statistics

I have been collecting keywords since March 31, and the database has now grown to a size of 73'336 records and 7MB (3MB compressed).

Download: coding horror keyword database (.zip) (downloaded: 541 times)

I ran a few simple SQL queries against the data using the GUI SQLite Database Browser to find the most popular keywords. I recommend downloading it, if you want to play around with the database.

The first query selected the 15 most popular keywords, along with their count, and percentage of all keywords.

The following SQL query did it:

SELECT
 count(query) c,
 (round(count(query)/(1.0*(select count(*) from queries)),3)*100) || '%',
 query
FROM queries
GROUP BY query
ORDER BY c DESC
LIMIT 15

most popular coding horror’s keywords (sql query in sqlite database browser)

I also made a bar chart using the public Google Charts API:

This chart would look much better if it had vertical bars. I couldn't figure out how to add keywords nicely below each bar, though.

Here is how the messy query to Google Charts API looks like:

http://chart.apis.google.com/chart?chtt=Coding%20Horror's%20Top%2015%20Keywords&cht=bhs&chd=t:100,77,12.07,10.18,9.09,8.74,8.64,8.49,7.05,6.51,5.91,5.71,5.66,5.61,5.22&chs=400x450&chxt=x,y&chxl=0:|0|2013|1:|command%20prompt%20commands|registration%20keys|cmd%20tricks|vista%20media%20center|sql%20joins|command%20prompt|you%20may%20be%20a%20victim...|codinghorror|dual%20core%20vs%20quad%20core|quad%20core%20vs%20dual%20core|cmd%20commands|command%20prompt%20tricks|system%20idle%20processes|coding%20horror|system%20idea%20process

Just to illustrate various ways to work with SQLite database, I did the same query from command line, and queried top 50 popular keywords, here they are:

$ sqlite3 ./codinghorror.db
sqlite> .header ON
sqlite> .explain ON
sqlite> SELECT count(query) c, query FROM queries GROUP BY query ORDER BY c DESC LIMIT 50;
c     query
----  -------------
2013  system idle process
1550  coding horror
243   system idle processes
205   command prompt tricks
183   cmd commands
176   quad core vs dual core
174   dual core vs quad core
171   codinghorror
142   you may be a victim of software counterfeiting
131   command prompt
119   sql joins
115   vista media center
114   cmd tricks
113   registration keys
105   command prompt commands
105   jeff atwood
99    quad core
96    dell xps m1330 review
89    rainbow tables
84    what is system idle process
82    software counterfeiting
80    fizzbuzz
78    laptop power consumption
77    quad core vs duo core
75    sql join
74    dell xps m1330
74    hard drive temperature
74    vista memory usage
73    source control
70    linked in
69    pontiac aztec
66    pontiac aztek
64    m1330 review
63    cracking
61    consolas
60    captcha
56    hyperterminal
56    ikea jerker
55    code horror
55    polling rate
55    source safe
54    coding horrors
54    dual core or quad core
54    programming quotes
54    visual source safe
53    logparser
51    sourcesafe
51    superfetch
51    three monitors
50    windows experience index

Knowing the most popular keywords can give you some hints what topics to write about on your blog. For example, an article named 'Windows Command Prompt Tricks' would start bringing good traffic from search engines instantly!

I did another bunch of queries to find the most popular programming languages on Coding Horror. I put the languages I could think of in langs.txt file, and ran the following Perl one-liner:

$ perl -MDBI -wlne 'BEGIN { $, = q/ /; $dbh = DBI->connect(q/dbi:SQLite:codinghorror.db/); } print +($dbh->selectrow_array(qq/SELECT count(query) FROM queries WHERE query LIKE "$_" OR query LIKE "$_ %" OR query LIKE "% $_" OR query LIKE "% $_ %"/))[0], $_' langs.txt | sort -n -r

It produced the following output:

1127 visual studio
1087 c#
407 c
287 javascript
239 java
139 asp
104 visual basic
59 php
44 ruby
42 python
26 perl
22 lisp
19 erlang
3 pascal
1 tcl
1 prolog
0 ml
0 haskell

I added 'visual studio' to the list of programming languages, as every beginner thinks it actually is a programming language. There were no keywords matching 'C++' because most search engines think of '+' as an operator rather than a valid search string.

I must say that Python is the answer to life, the universe and everything, as it was searched for 42 times! :)

Here is the same data put on a chart:

Here are some of the most popular search queries among programming languages:

I suggest that you download the keyword database and analyze the data that interests you the most yourself!

Downloads

Download Perl program: coding horror keyword scraper
Downloaded: 2998 times

Download SQLite database(3 MB): coding horror keyword database (.zip)
Downloaded: 541

If you liked the post, why not vote for it?

This article is part of the article series "Musical Geek Friday."
<- previous article next article ->

leech access - coming at you (leech axss - comin at choo)Continuing my Friday geek music series, I am presenting to you a very geeky hip-hop song about downloading pirated stuff, such as music, software and movies (so called "warez") off the net.

The song is originally written by guys calling themselves Leech Axss and it's called "Leech Axss - Coming@Choo".

This song is NSFW - not suitable for work, as it contains explicit language! Though, you can listen to it on your headphones. :)

As I mentioned in my first geek music post, I'll not just post the song, but also provide a little insight into the song.

This song is about a lamer trying to gain leech access to some guy's warez ftp server. Usually, an access to a site with hundreds of gigabytes of warez, with no intentions to upload any new content, is called "leech access". It's every beginner's dream to have leech access to any server. Unfortunately, if you are not already well respected, you can't just have leech access. To have an access, you must provide some value to the site. For example, you must upload some 0day stuff. Digital content is called 0day if it gets distributed on warez servers before it actually gets released by the company.

The lamer in this song is suggested to use his real email address as a password for the ftp (note that anonymous ftp access usually asks for an email as a password). Being totally lame, he provides his real email, gets sent trojans and viruses, gets mail bombed and his machine finally gets owned.

[audio:http://www.catonmat.net/download/leech-axss_coming-at-choo.mp3]

Download this song: leech axss - coming at you.mp3 (musical geek friday #2)
Downloaded: 14361 times

Download lyrics (not censored): leech axss - coming at you lyrics (musical geek friday #2)
Downloaded: 3904

Here is the lyrics (I censored the explicit language, see the 'download lyrics' link above for uncensored version):

where is my snare?
i have no snare in my headphones
oh, there's my snare
in my audio warez folder, ho ho ho ho ho

leech axss, leech axss, leech, leech axss

freebsd is da s**t to me
linux, stick it up in your a**, you get me
you came to f**k with me in the irc
that i didn't give you access to my ftp
little dood, with a f**kin' +v in your nick
you might as well be sucking my motherf**kin' d**k
message of the day says that you are lame
so prevent the pain and get a dc j
leech axss, ain't no dude to f**k with
leech axss, ain't no dude to chat with
'cause i'm downloading chicks-with-d**s.avi
and i'm loadin' edonkey my windows swap file
yo yo yo, where's your 0day
you ain't got no 0day, because you're gay
because you are afraid and so easy to break
make it easy to take over you pc and f**k it up straight

refrain:
leech axss is comin' at you, your box is mine in minute or two
your firewalls are tumbling down, leeching all the 0day that is found
dvs in you mp3s, you gotta fear my leet-o skillz
comin' inside the megabytes, leech axss you just can't fight

leech axss, leech axss, leech, leech axss ho ho ho ho ho

just put your e-mail in the password-box
now i've got your info, b***h, thanks a lot
i'ma send you a motherf**king e-mail bomb
dos your isp, dada dam da dam
trojan horses and viruses are coming at you
gold-sex is the site where you gonna re-route
meanwhile i hax and gonna gain the root
"f**k you" is message before you reboot
whoops, did i open your cd-drive?
whoops, did i f**king read your mind?
two thousand messages in your icq
and your soundcard just lost the irq
these are the wicked ways of leech axss
i am leet - you're nothing but your daddy's ball sweat
check me in my channel, as me operate
and get more net sex than my n****r, bill gates

refrain:
leech axss is comin' at you, your box is mine in minute or two
your firewalls are tumbling down, leeching all the 0day that is found
dvs in you mp3s, you gotta fear my leet-o skillz
comin' inside the megabytes, leech axss you just cant fight

leech axss is comin' at you, your box is mine in minute or two
your firewalls are tumbling down, leeching all the 0day that is found
dvs in you mp3s, you gotta fear my leet-o skillz
comin' inside the megabytes, leech axss you just can't fight

ctrl + alt + del

Download Leech Access is Coming at You Song

Download this song: leech axss - coming at you.mp3 (musical geek friday #2)
Downloaded: 14361 times

Download lyrics (not censored): leech axss - coming at you lyrics (musical geek friday #2)
Downloaded: 3904

Click to listen:
[audio:http://www.catonmat.net/download/leech-axss_coming-at-choo.mp3]

Have fun and until next geeky Friday! :)

guy l. steele jr. growing a language java acm talkI found a really exciting video lecture by Guy L. Steele that I'd like to share with you. The title of the lecture is "Growing a Language".

The main thing Guy Steele asks during the lecture is "If I want to help other persons to write all sorts of programs, should I design a small programming language or a large one?" He answers that he should build neither a small, nor a big language. He needs to design a language that can grow. The main goal in designing a language should be to plan for growth. The language must start small, and the language must grow as the set of users grows.

As an example, he compares APL and Lisp. APL did not allow its users to grow the language in a "smooth" way. Adding new primitives to the language did not look the same as built-in primitives, this made users the language hard to grow. In Lisp, on the other hand, new words defined by the user look like language primitives, language primitives look like user defined words. It made language users easily extend the language, share their code, and grow the language.

Mr. Steele also prepared a PDF of his talk. Download it here (mirror, just in case: here).

He currently works at Sun Microsystems and he is responsible for research in language design and implementation strategies. His bio page on Sun Microsystems page says: "He has been praised for an especially clear and thorough writing style in explaining the details of programming languages." This lecture really shows it.

I understood what he was up to from the very beginning of the lecture. Only after the first ten minutes Guy revealed that "his firm rule for this talk is that if he needs to use a word of two or more syllables, he must define it."

Another thing Guy Steele shows with this talk is how a small language restricts the expressiveness of your thoughts. First you must define a lot of new words to be able to express yourself clearly and quickly.

Should a programming language be small or large? A small programming language might take but a short time to learn. A large programming language may take a long, long time to learn, but then it is less hard to use, for we then have a lot of words at hand — or, I should say, at the tips of our tongues — to use at the drop of a hat. If we start with a small language, then in most cases we can not say much at the start. We must first define more words; then we can speak of the main thing that is on our mind. [...] If you want to get far at all with a small language, you must first add to the small language to make a language that is more large.

He gives many more interesting points how languages should be grown. Just watch the lecture!

He defined the following words during the lecture: woman, person, machine, other, other than, number, many, computer, vocabulary, language, define, program, definition, example, syllable, primitive, because, design, twenty, thirty, forty, hundred, million, eleven, thirteen, fourteen, sixteen, seven, fifty, ago, library, linux, operating system, cathedral, bazaar, pattern, datum, data, object, method, generic type, operator, overloaded, polymorphic, complex number, rational number, interval, vector, matrix, meta.

This article is part of the article series "Musical Geek Friday."
<- previous article next article ->

musical geek friday - cryptoI looked through my mp3 collection and found some really geeky and funny songs. I thought, why not add a little more fun to my blog! The songs are really geeky and my readers might like them.

I'll be posting a song each Friday until I run out of songs (I have 6 songs at the moment, and maybe I find some more). I'll also try to add some comments, what it is about, in case someone has not enough geekiness to understand it :)

I urge you to subscribe to my feed to get notified about all the upcoming geek music automatically! :)

The first song I present to you, as the title suggests, is about cryptography. Particularly about a cracker trying to defeat a 56 bit and 128 bit symmetric key algorithms, such as DES or RC4, used in secure online communications. Note that 56 bit symmetric keys are no longer recommended as they can be broken in no time with special hardware and modern computers.

Song's title is "Crypto" and as I found out, it is a parody of Banana Boat Song by Harry Belafonte.

[audio:http://www.catonmat.net/download/crypt-o.mp3]

Download this song: crypt-o.mp3 (musical geek friday #1)
Downloaded: 73617 times

Download lyrics: crypt-o lyrics (musical geek friday #1)
Downloaded: 3372

Unfortunately, I could not find the author of this parody. In case you know, please tell me in the comments, so I can give proper credits.

Here is the lyrics:

Crypto
It's in Crypto
Crypto come and the crook go home

Safe?
Is it safe?
It's all safe
Pretty safe
It's okay
It's okay-o
Cracker come but he won't break code

World get small when the internet come
Online come and we shop from home
Download the software 'till the morning's done
Daylight come and me hard drive full
Punch in my credit card order me a dancer
Dancer come and me g-string go
Come Mr. Businessman join in the bonanza
Shopper come and the overhead's low

56 bit key is a great big bunch
Crypto come and the crook go home
128 bit even harder to crunch
Cracker try but he just grow old

Crypto,
It's in Crypto
Crypto come and the cash can flow

Safe?
Is it safe?
It's all safe
Pretty safe
It's okay
It's okay-o

Crypto come and the crook get boned

Download Crypto Song

Download this song: crypt-o.mp3 (musical geek friday #1)
Downloaded: 73617 times

Download lyrics: crypt-o lyrics (musical geek friday #1)
Downloaded: 3372

Click to listen:
[audio:http://www.catonmat.net/download/crypt-o.mp3]

reddit river: what flows online!On my way back home from university (30 minutes+) I just love to read news from my favorite social news site reddit.com. A few weeks ago I saw this 'Ask Reddit' post which asked if we could get a reddit version for mobile phones. Well, I thought, it's a cool project and I can do it quickly.

While scanning through the comments of 'Ask Reddit' post, I noticed davidlvann's comment where he said that Digg.com already had almost a plain text version of Digg, called DiggRiver.com.

It didn't take me long to do a

$ whois redditriver.com
No match for "REDDITRIVER.COM".

to find that the domain RedditRiver.com was not registered! What a great name for a project! I quickly mailed my friend Alexis [kn0thing] Ohanian at Reddit (check his alien blog) to ask a permission to do a Reddit River project. Sure enough, he registered the domain for me and I was free to make it happen!

I'll describe how I made the site, and I will release full source code.

Update: The project is now live!

Update: Full source code is now available! It includes all the scripts mentioned here!

Download full redditriver.com source code (downloaded 6905 times)

My language of choice for this project is Python, the same language reddit.com is written in.

This is actually the first real project I am doing in Python (I'm a big Perl fan). I have a good overall understanding of Python but I have never done a project from the ground up! Before doing the project I watched a few Python video lectures and read a bunch of articles to get into a mindset of a Pythonista.

Designing Stages of RedditRiver.com

The main goal of the project was to create a very lightweight version of reddit, which would monitor for story changes (as they get up/down voted) on several pages across the most popular popular subreddits, and which would find mobile versions of stories posted (what I mean is rewrite URLs, say, a post to The Washington Post gets rewritten to the print version of the same article, or a link to youtube.com gets rewritten to the mobile version of yotube.com -- m.youtube.com, etc.).

The project was done in several separate steps.

  • First, I set up the web server to handle Python applications,
  • Then I created a few Python modules to extract contents of Reddit website,
  • Next I created an SQLite database and wrote a few scripts to save the extracted data,
  • Then I wrote a Python module to discover mobile versions of given web pages,
  • Finally, I created the web.py application to handle requests to RedditRiver.com!

Setting up the Web Server

I am very lucky to have a full dedicated server sponsored by ZigZap - We Are Tech (I seriously recommend them if you are looking for a great hosting!). Being an experienced Linux user, I asked them for a pure Linux server with no software or control panels pre-installed and that's exactly what I got! Thanks, ZigZap! :)

I already run this blog and picurls.com on the server and I had chosen lighttpd web server and PHP programming language for these two projects. To get RedditRiver running, I had to add Python support to the web server.

I decided to run web.py web framework to serve the HTML contents because of its simplicity and because Reddit guys used it themselves after rewriting Reddit from Lisp to Python.

Following the install instructions, getting web.py running on the server was as simple as installing the web.py package!

It was also just as easy to get lighttpd web server to communicate with web.py and my application. This required flup package to be installed to allow lighttpd to interface with web.py.

Update: after setting it all up, and experimenting a bit with web.py (version 0.23) and Cheetah's templates, I found that for some mysterious reason web.py did not handle "#include" statements of the templates. The problem was with web.py's 'cheetah.py' file, line 23, where it compiled the regular expression for handling "#include" statements:

r_include = re_compile(r'(?!\\)#include \"(.*?)\"($|#)', re.M)

When I tested it out in interpreter,

>>> r_include = re.compile(r'(?!\\)#include \"(.*?)\"($|#)', re.M)
>>> r_include.search('#include "foo"').groups()
('foo', '')
>>> r_include.search('foo\n#include "bar.html"\nbaz').groups()
('bar.html', '')

it found #include's accross multiline text lines just fine, but it did not work with my template files. I tested it like 5 times and just couldn't get it why it was not working.

As RedditRiver is the only web.py application running on my server, I easily patched that regex on line 23 to something trivial and it all started working! I dropped all the negative lookahead magic and checking for end of the line:

r_include = re_compile(r'#include "(.*?)"', re.M)

As I said, I am not sure why the original regex did not work in the web.py application, but did work in the interpreter. If anyone knows what happened, I will be glad to hear from you! :)

Accessing Reddit Website via Python

I wrote several Python modules (which also work as executables) to access information on Reddit - stories across multiple pages of various subreddits (and front page) and user created subreddits.

As Reddit still does not provide an API to access the information on their site, I had to extract the relevant information from the HTML content of the pages.

The first module I wrote is called 'subreddits.py' which accesses http://reddit.com/reddits and returns (or prints out, if used as an executable) the list of the most popular subreddits (a subreddit is a reddit for a specific topic, for example, programming or politics)

Get this program here: subreddit extractor (redditriver.com project) (downloaded: 4827 times).

This module provides three useful functions:

  • get_subreddits(pages=1, new=False), which gets 'pages' pages of subreddits and returns a list of dictionaries of them. If new is True, gets 'pages' pages of new subreddits (http://reddit.com/reddits/new),
  • print_subreddits_paragraph(), which prints subreddits information in human readable format, and
  • print_subreddits_json(), which prints it in JSON format. The output is in utf-8 encoding.

The way this module works can be seen from the Python interpreter right away:

>>> import subreddits
>>> srs = subreddits.get_subreddits(pages=2)
>>> len(srs)
50
>>> srs[:5]
[{'position': 1, 'description': '', 'name': 'reddit.com', 'subscribers': 11031, 'reddit_name': 'reddit.com'}, {'position': 2, 'description': '', 'name': 'politics', 'subscribers': 5667, 'reddit_name': 'politics'}, {'position': 3, 'description': '', 'name': 'programming', 'subscribers': 9386, 'reddit_name': 'programming'}, {'position': 4, 'description': 'Yeah reddit, you finally got it. Context appreciated.', 'name': 'Pictures and Images', 'subscribers': 4198, 'reddit_name': 'pics'}, {'position': 5, 'description': '', 'name': 'obama', 'subscribers': 651, 'reddit_name': 'obama'}]
>>>
>>> from pprint import pprint
>>> pprint(srs[3:5])
[{'description': 'Yeah reddit, you finally got it. Context appreciated.',
  'name': 'Pictures and Images',
  'reddit_name': 'pics',
  'subscribers': 4198},
 {'description': '',
  'name': 'obama',
  'reddit_name': 'obama',
  'subscribers': 651}]
>>>
>>> subreddits.print_subreddits_paragraph(srs[3:5])
position: 4
name: Pictures and Images
reddit_name: pics
description: Yeah reddit, you finally got it. Context appreciated.
subscribers: 4198

position: 5
name: obama
reddit_name: obama
description:
subscribers: 651
>>>
>>> subreddits.print_subreddits_json(srs[3:5])
[
    {
        "position": 4,
        "description": "Yeah reddit, you finally got it. Context appreciated.",
        "name": "Pictures and Images",
        "subscribers": 4198,
        "reddit_name": "pics"
    },
    {
        "position": 4,
        "description": "",
        "name": "obama",
        "subscribers": 651,
        "reddit_name": "obama"
    }
]

Or it can be called from the command line:

$ ./subreddits.py --help
usage: subreddits.py [options]

options:
  -h, --help  show this help message and exit
  -oOUTPUT    Output format: paragraph or json. Default: paragraph.
  -pPAGES     How many pages of subreddits to output. Default: 1.
  -n          Retrieve new subreddits. Default: nope.

This module reused the awesome BeautifulSoup HTML parser module, and simplejson JSON encoding module.

The second program I wrote is called 'redditstories.py' which accesses the specified subreddit and gets the latest stories from it. It was written pretty much the same way I did it for redditmedia project in Perl.

Get this program here: reddit stories extractor (redditriver.com project) (downloaded: 3093 times).

This module also provides three similar functions:

  • get_stories(subreddit='front_page', pages=1, new=False), which gets 'pages' pages of stories from subreddit and returns a list of dictionaries of them. If new is True, gets new stories only,
  • print_stories_paragraph(), which prints subreddits information in human readable format, and
  • print_stories_json(), which prints it in JSON format. The output is in utf-8 encoding.

It can also be used as a Python module or executable.

Here is an example of using it as a module:

>>> import redditstories
>>> s = redditstories.get_stories(subreddit='programming')
>>> len(s)
25
>>> s[2:4]
[{'title': "when customers don't pay attention and reply to a "donotreply.com" email address, it goes to Chet Faliszek, a programmer in Seattle", 'url': 'http://consumerist.com/371600/the-man-who-owns-donotreplycom-knows-all-the-secrets-of-the-world', 'unix_time': 1206408743, 'comments': 54, 'subreddit': 'programming', 'score': 210, 'user': 'srmjjg', 'position': 3, 'human_time': 'Tue Mar 25 03:32:23 2008', 'id': '6d8xl'}, {'title': 'mysql --i-am-a-dummy', 'url': 'http://dev.mysql.com/doc/refman/4.1/en/mysql-tips.html#safe-updates', 'unix_time': 1206419543, 'comments': 59, 'subreddit': 'programming', 'score': 135, 'user': 'enobrev', 'position': 4, 'human_time': 'Tue Mar 25 06:32:23 2008', 'id': '6d9d3'}]
>>> from pprint import pprint
>>> pprint(s[2:4])
[{'comments': 54,
  'human_time': 'Tue Mar 25 03:32:23 2008',
  'id': '6d8xl',
  'position': 3,
  'score': 210,
  'subreddit': 'programming',
  'title': "when customers don't pay attention and reply to a "donotreply.com" email address, it goes to Chet Faliszek, a programmer in Seattle",
  'unix_time': 1206408743,
  'url': 'http://consumerist.com/371600/the-man-who-owns-donotreplycom-knows-all-the-secrets-of-the-world',
  'user': 'srmjjg'},
 {'comments': 59,
  'human_time': 'Tue Mar 25 06:32:23 2008',
  'id': '6d9d3',
  'position': 4,
  'score': 135,
  'subreddit': 'programming',
  'title': 'mysql --i-am-a-dummy',
  'unix_time': 1206419543,
  'url': 'http://dev.mysql.com/doc/refman/4.1/en/mysql-tips.html#safe-updates',
  'user': 'enobrev'}]
>>> redditstories.print_stories_paragraph(s[:1])
position: 1
subreddit: programming
id: 6daps
title: Sign Up Forms Must Die
url: http://www.alistapart.com/articles/signupforms
score: 70
comments: 43
user: markokocic
unix_time: 1206451943
human_time: Tue Mar 25 15:32:23 2008

>>> redditstories.print_stories_json(s[:1])
[
    {
        "title": "Sign Up Forms Must Die",
        "url": "http:\/\/www.alistapart.com\/articles\/signupforms",
        "unix_time": 1206451943,
        "comments": 43,
        "subreddit": "programming",
        "score": 70,
        "user": "markokocic",
        "position": 1,
        "human_time": "Tue Mar 25 15:32:23 2008",
        "id": "6daps"
    }
]

Using it from a command line:

$ ./redditstories.py --help
usage: redditstories.py [options]

options:
  -h, --help   show this help message and exit
  -oOUTPUT     Output format: paragraph or json. Default: paragraph.
  -pPAGES      How many pages of stories to output. Default: 1.
  -sSUBREDDIT  Subreddit to retrieve stories from. Default:
               reddit.com.
  -n           Retrieve new stories. Default: nope.

These two programs just beg to be converted into a single Python module. They have the same logic with just a few changes in the parser. But for the moment I am generally happy, and they serve the job well. They can also be understood individually without having a need to inspect several source files.

I think that one of the future posts could be a reddit information accessing library in Python.

I can already think of one hundred ideas what someone can do with such a library. For example, one could print out top programming stories his or her shell:

$ echo "Top five programming stories:" && echo && ./redditstories.py -s programming | grep 'title' | head -5 && echo && echo "Visit http://reddit.com/r/programming to view them!"

Top five programming stories:

title: Sign Up Forms Must Die
title: You can pry XP from my cold dead hands!
title: mysql --i-am-a-dummy
title: when customers don't pay attention and reply to a "donotreply.com" email address, it goes to Chet Faliszek, a programmer in Seattle
title: Another canvas 3D Renderer written in Javascript

Visit http://reddit.com/r/programming to view them!

Creating and Populating the SQLite Database

The database choice for this project is SQLite, as it is fast, light and this project is so simple, that I can't think of any reason to use a more complicated database system.

The database has a trivial structure with just two tables 'subreddits' and 'stories'.

CREATE TABLE subreddits (
  id           INTEGER  PRIMARY KEY  AUTOINCREMENT,
  reddit_name  TEXT     NOT NULL     UNIQUE,
  name         TEXT     NOT NULL     UNIQUE,
  description  TEXT,
  subscribers  INTEGER  NOT NULL,
  position     INTEGER  NOT NULL,
  active       BOOL     NOT NULL     DEFAULT 1
);

INSERT INTO subreddits (id, reddit_name, name, description, subscribers, position) VALUES (0, 'front_page', 'reddit.com front page', 'since subreddit named reddit.com has different content than the reddit.com frontpage, we need this', 0, 0);

CREATE TABLE stories (
  id            INTEGER    PRIMARY KEY  AUTOINCREMENT,
  title         TEXT       NOT NULL,
  url           TEXT       NOT NULL,
  url_mobile    TEXT,
  reddit_id     TEXT       NOT NULL,
  subreddit_id  INTEGER    NOT NULL,
  score         INTEGER    NOT NULL,
  comments      INTEGER    NOT NULL,
  user          TEXT       NOT NULL,
  position      INTEGER    NOT NULL,
  date_reddit   UNIX_DATE  NOT NULL,
  date_added    UNIX_DATE  NOT NULL
);

CREATE UNIQUE INDEX idx_unique_stories ON stories (title, url, subreddit_id);

The 'subreddits' table contains information extracted by 'subreddits.py' module (described earlier). It keeps the information and positions of all the subreddits which appeared on the most popular subreddit page (http://reddit.com/reddits).

Reddit lists 'reddit.com' as a separate subreddit on the most popular subreddit page, but it turned out that it was not the same as the front page of reddit! That's why I insert a fake subreddit called 'front_page' in the table right after creating it, to keep track of both 'reddit.com' subreddit and reddit's front page.

The information in the table is updated by a new program - update_subreddits.py.

View: subreddit table updater (redditriver.com project) (downloaded: 2220 times)

The other table, 'stories' contains information extracted by 'redditstories.py' module (also described earlier).

The information in this table is updated by another new program - update_stories.py.

As it is impossible to keep track of all the scores and comments, and position changes across all the subreddits, the program monitors just a few pages on each of the most popular subreddits.

View: story table updater (redditriver.com project) (downloaded: 2169 times)

These two programs are run periodically by crontab (task scheduler in unix). The program update_subreddits.py gets run every 30 minutes and update_stories.py every 5 minutes.

Finding the Mobile Versions of Given Websites

This is probably the most interesting piece of software that I wrote for this project. The idea is to find versions of a website suitable for viewing on a mobile device.

For example, most of the stories on politics subreddit link to the largest online newspapers and news agencies, such as The Washington Post or MSNBC. These websites provide a 'print' version of the page which is ideally suitable for mobile devices.

Another example is websites who have designed a real mobile version of their page and let the user agent know about it by placing <link rel="alternate" media="handheld" href="..."> tag in the head section of an html document.

I wrote an 'autodiscovery' Python module called 'autodiscover.py'. This module is used by the update_stories.py program described in the previous section. After getting the list of new reddit stories, the update_stories.py tries to autodiscover a mobile version of the story and if it is successful, it places it in 'url_mobile' column of the 'stories' table.

Here is an example run from Python interpreter of the module:

>>> from autodiscovery import AutoDiscovery
>>> ad = AutoDiscovery()
>>> ad.autodiscover('http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969.html')
'http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969_pf.html'
>>> ad.autodiscover('http://www.msnbc.msn.com/id/11880954/')
'http://www.msnbc.msn.com/id/11880954/print/1/displaymode/1098/'

And it can also be used from command line:

$ ./autodiscovery.py http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969.html
http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969_pf.html

Source: mobile webpage version autodisovery (redditriver.com project) (downloaded 3720 times)

This module actually uses a configuration file 'autodisc.conf' which defines patterns to look for in the web page's HTML code. At the moment the config file is pretty primitive and defines just three configuration options:

  • REWRITE_URL defines a rule how to rewrite URL of a website which makes it difficult to autodiscover the mobile link easily. For example, a page could use JavaScript to pop-up the print version of the page. In such a case REWRITE_URL rule can be used to match the host which uses this technique and rewrite part of the url to another.
  • PRINT_LINK defines how a print link might look like. For example, it could say 'print this page' or 'print this article'. This directive defines such phrases to look for.
  • IGNORE_URL defines urls to ignore. For example, a link to a flash animation should definitely be ignored, as it does not define a mobile version at all. You can place the .swf extension in this ignore list to avoid it being downloaded by autodiscovery.py.

Configuration used by autodiscovery.py: autodiscovery configuration (redditriver.com project) (downloaded 3680)

Creating the web.py Application

The final part to the project was creating the web.py application.

It was pretty straight forward to create it as it only required writing the correct SQL expressions for selecting the right data out of the database.

Here is how the controller for the web.py application looks like:

urls = (
    '/',                                 'RedditRiver',
    '/page/(\d+)/?',                     'RedditRiverPage',
    '/r/([a-zA-Z0-9_.-]+)/?',            'SubRedditRiver',
    '/r/([a-zA-Z0-9_.-]+)/page/(\d+)/?', 'SubRedditRiverPage',
    '/reddits/?',                        'SubReddits',
    '/stats/?',                          'Stats',
    '/stats/([a-zA-Z0-9_.-]+)/?',        'SubStats',
    '/about/?',                          'AboutRiver'
)

The first version of reddit river implements browsable front stories (RedditRiver and RedditRiverPage classes), browsable subreddit stories (SubRedditRiver and SubRedditRiverPage classes), list of the most popular subreddits (SubReddits class), front page and subreddit statistics (most popular stories and most active users, Stats and SubStats classes) and an about page (AboutRiver class).

The source code: web.py application (redditriver.com project) (downloaded: 3720 times)

Release

I have put it online! Click redditriver.com to visit the site.

I have also released the source code. Here are all the files mentioned in the article, and a link to the whole website package.

Download Programs which Made Reddit River Possible

All the programs in a single .zip:
Download link: full redditriver.com source code
Downloaded: 6905 times

Individual scripts:

Download link: subreddit extractor (redditriver.com project)
Downloaded: 4827 times

Download link: reddit stories extractor (redditriver.com project)
Downloaded: 3093 times

Download link: subreddit table updater (redditriver.com project)
Downloaded: 2220 times

Download link: story table updater (redditriver.com project)
Downloaded: 2169 times

Download link: mobile webpage version autodisovery (redditriver.com project)
Downloaded: 3720 times

Download link: autodiscovery configuration (redditriver.com project)
Downloaded: 3680 times

Download link: web.py application (redditriver.com project)
Downloaded: 2788 times

All these programs are released under GNU GPL license, so you may derive your own stuff, but do not forget to share your derivative work with everyone!

Vote for this article:

Alexis recently sent me a reddit t-shirt for doing redditmedia project, I decided to take a few photos wearing it :)

peteris krumins loves reddit

Have fun and I hope to hear a lot of positive feedback on redditriver project :)