How Reddit Top and Hacker Top Programs Were Made

Over the last two weeks I have released two top-like applications to follow Reddit and Hacker News from the console. I called these applications Reddit Top and Hacker Top. I received a few emails and comments asking me to explain how the applications were made. I'll explain it in this article.

A few months ago, while I was creating the Reddit River website, I noticed that Python's standard library included a curses module. Having worked with curses and Curses Development Kit in C, I decided to refresh my curses skills, this time in Python.

Coding the application started with creating two separate Python modules for retrieving stories from Hacker News and Reddit.

If you look at the source code of Hacker Top and Reddit Top programs, you'll notice two Python modules called "pyhackerstories.py" and "pyredditstories.py".

Both of these modules follow the same interface and provide function get_stories(). The core functionality of this function can be easily understood from this code fragment:

stories = []
for i in range(pages):
    content = _get_page(url)
    entries = _extract_stories(content)
    stories.extend(entries)
    url = _get_next_page(content)
    if not url:
        break

The function iterates over the given number of Reddit or Hacker News pages and creates a list of objects (of type Story) containing information about stories on each page.

This function takes two optional parameters 'pages' and 'new' (pyredditstories.py also takes an optional 'subreddit' parameter). These parameters control how many pages of (new) stories to scrape.

Here is an example of using pyhackerstories.py module to get the titles and scores of the first five most popular stories on Hacker News:

>>> from pyhackerstories import get_stories
>>> stories = get_stories()
>>> for story in stories[:5]:
...     print "%-3d - %s" % (story.score, story.title)
...
30  - Rutgers Graduate Student Finds New Prime-Generating Formula
59  - Xobni VP Engineering leaves for own startup
69  - The Pooled-Risk Company Management Company
14  - Jeff Bonforte, CEO of Xobni, explains why Gabor left
52  - Google's Wikipedia clone Knol launches.

Each Story object contains the following properties:

position - story position
id - identifier used by Hacker News or Reddit to identify the story
title - title of the story
url - web address of the story
user - username of the user who submitted the story
score - number of upvotes the story has received
human_time - time the story was submitted
unix_time - unix time the story was submitted
comments - number of comments the story has received

Here is another example of retrieving most active Reddit users on Programming Subreddit (based on 5 pages of stories):

>>> from pyredditstories import get_stories
>>> stories = get_stories(subreddit='programming', pages=5)
>>> userdict = {}
>>> for story in stories:
...     userdict[story.user] = userdict.setdefault(story.user, 0) + 1
>>> users = [(userdict[u], u) for u in userdict]
>>> users.sort()
>>> users.reverse()
>>> for user in users[:5]:
...  print "%s: %d" % (user[1], user[0])
...
<a href="http://www.reddit.com/user/gst/">gst</a>: 19
<a href="http://www.reddit.com/user/dons/">dons</a>: 6
<a href="http://www.reddit.com/user/llimllib/">llimllib</a>: 4
<a href="http://www.reddit.com/user/synthespian/">synthespian</a>: 3
<a href="http://www.reddit.com/user/gthank/">gthank</a>: 3

The pyhackerstories.py and pyredditstories.py Python modules can also be used as standalone applications.

Executing pyredditstories.py with '--help' command line argument tells that:

$ ./pyredditstories.py  --help
usage: pyredditstories.py [options]

options:
  -h, --help   show this help message and exit
  -sSUBREDDIT  Subreddit to retrieve stories from. Default:
               front_page.
  -pPAGES      How many pages of stories to output. Default: 1.
  -n           Retrieve new stories. Default: nope.

Here is an example of executing pyredditstories.py to get five most popular programming stories:

$ ./pyredditstories.py -s programming | grep '^title' | head -5

title: "Turns out my nephew is really good with computers, so we're going to give him the job!"
title: Sun Microsystems funding Haskell on multicore OpenSPARC!
title: I love parser combinators [Haskell]
title: 2D collision detection for SVG - demo of intersection routines with SVG/Javascript
title: Priorities: Solaris vs Linux

That's about enough information about these modules to create all kinds of wonderful things! Just play around a little!

Download pyhackerstories.py: catonmat.net/ftp/pyhackerstories.py
Download pyredditstories.py: catonmat.net/ftp/pyredditstories.py

Now I'll briefly explain how the curses user interface was made. I'll cover only the ideas and will not go deeply into code.

I started with reading Curses Programming with Python howto. This howto explained how to do all the basic curses operations in Python - how to get into curses mode, how to create windows, print colorful text and how to handle user input.

The user interface had two requirements - it had to be responsive while the new stories were being retrieved from the sites, and it had to be independent from the website, meaning that it should be easy to display data from a different website with minor (or no) modifications to the interface code.

The first requirement was satisfied by creating a separate thread (see Retriever class) which periodically called get_stories() and passed the stories to the interface via a queue.

The second requirement was already satisfied when I created the pyredditstories.py and pyhackerstories.py modules. As I mentioned, these modules provided the same interface for retrieving stories and could be used almost interchangeably.

That's basically it! If you have any specific questions, feel free to contact me.

Download Hacker Top and Reddit Top Programs

Hacker Top

Download link: hacker-top.tgz

Reddit Top

Download link: reddit-top.tgz

Next I'll release Dzone Top and Digg Top, and then merge all these top programs into a single social news console program. See you then!