reddit river: what flows online!

On my way back home from university (30 minutes+) I just love to read news from my favorite social news site reddit.com. A few weeks ago I saw this 'Ask Reddit' post which asked if we could get a reddit version for mobile phones. Well, I thought, it's a cool project and I can do it quickly.

While scanning through the comments of 'Ask Reddit' post, I noticed davidlvann's comment where he said that Digg.com already had almost a plain text version of Digg, called DiggRiver.com.

It didn't take me long to do a

$ whois redditriver.com
No match for "REDDITRIVER.COM".

to find that the domain RedditRiver.com was not registered! What a great name for a project! I quickly mailed my friend Alexis [kn0thing] Ohanian at Reddit (check his alien blog) to ask a permission to do a Reddit River project. Sure enough, he registered the domain for me and I was free to make it happen!

I'll describe how I made the site, and I will release full source code.

Update: The project is now live!

Update: Full source code is now available! It includes all the scripts mentioned here!

Download full redditriver.com source code (downloaded 7359 times)

My language of choice for this project is Python, the same language reddit.com is written in.

This is actually the first real project I am doing in Python (I'm a big Perl fan). I have a good overall understanding of Python but I have never done a project from the ground up! Before doing the project I watched a few Python video lectures and read a bunch of articles to get into a mindset of a Pythonista.

Designing Stages of RedditRiver.com

The main goal of the project was to create a very lightweight version of reddit, which would monitor for story changes (as they get up/down voted) on several pages across the most popular popular subreddits, and which would find mobile versions of stories posted (what I mean is rewrite URLs, say, a post to The Washington Post gets rewritten to the print version of the same article, or a link to youtube.com gets rewritten to the mobile version of yotube.com -- m.youtube.com, etc.).

The project was done in several separate steps.

  • First, I set up the web server to handle Python applications,
  • Then I created a few Python modules to extract contents of Reddit website,
  • Next I created an SQLite database and wrote a few scripts to save the extracted data,
  • Then I wrote a Python module to discover mobile versions of given web pages,
  • Finally, I created the web.py application to handle requests to RedditRiver.com!

Setting up the Web Server

I am very lucky to have a full dedicated server sponsored by ZigZap - We Are Tech (I seriously recommend them if you are looking for a great hosting!). Being an experienced Linux user, I asked them for a pure Linux server with no software or control panels pre-installed and that's exactly what I got! Thanks, ZigZap! :)

I already run this blog and picurls.com on the server and I had chosen lighttpd web server and PHP programming language for these two projects. To get RedditRiver running, I had to add Python support to the web server.

I decided to run web.py web framework to serve the HTML contents because of its simplicity and because Reddit guys used it themselves after rewriting Reddit from Lisp to Python.

Following the install instructions, getting web.py running on the server was as simple as installing the web.py package!

It was also just as easy to get lighttpd web server to communicate with web.py and my application. This required flup package to be installed to allow lighttpd to interface with web.py.

Update: after setting it all up, and experimenting a bit with web.py (version 0.23) and Cheetah's templates, I found that for some mysterious reason web.py did not handle "#include" statements of the templates. The problem was with web.py's 'cheetah.py' file, line 23, where it compiled the regular expression for handling "#include" statements:

r_include = re_compile(r'(?!\\)#include \"(.*?)\"($|#)', re.M)

When I tested it out in interpreter,

>>> r_include = re.compile(r'(?!\\)#include \"(.*?)\"($|#)', re.M)
>>> r_include.search('#include "foo"').groups()
('foo', '')
>>> r_include.search('foo\n#include "bar.html"\nbaz').groups()
('bar.html', '')

it found #include's accross multiline text lines just fine, but it did not work with my template files. I tested it like 5 times and just couldn't get it why it was not working.

As RedditRiver is the only web.py application running on my server, I easily patched that regex on line 23 to something trivial and it all started working! I dropped all the negative lookahead magic and checking for end of the line:

r_include = re_compile(r'#include "(.*?)"', re.M)

As I said, I am not sure why the original regex did not work in the web.py application, but did work in the interpreter. If anyone knows what happened, I will be glad to hear from you! :)

Accessing Reddit Website via Python

I wrote several Python modules (which also work as executables) to access information on Reddit - stories across multiple pages of various subreddits (and front page) and user created subreddits.

As Reddit still does not provide an API to access the information on their site, I had to extract the relevant information from the HTML content of the pages.

The first module I wrote is called 'subreddits.py' which accesses http://reddit.com/reddits and returns (or prints out, if used as an executable) the list of the most popular subreddits (a subreddit is a reddit for a specific topic, for example, programming or politics)

Get this program here: subreddit extractor (redditriver.com project) (downloaded: 5487 times).

This module provides three useful functions:

  • get_subreddits(pages=1, new=False), which gets 'pages' pages of subreddits and returns a list of dictionaries of them. If new is True, gets 'pages' pages of new subreddits (http://reddit.com/reddits/new),
  • print_subreddits_paragraph(), which prints subreddits information in human readable format, and
  • print_subreddits_json(), which prints it in JSON format. The output is in utf-8 encoding.

The way this module works can be seen from the Python interpreter right away:

>>> import subreddits
>>> srs = subreddits.get_subreddits(pages=2)
>>> len(srs)
50
>>> srs[:5]
[{'position': 1, 'description': '', 'name': 'reddit.com', 'subscribers': 11031, 'reddit_name': 'reddit.com'}, {'position': 2, 'description': '', 'name': 'politics', 'subscribers': 5667, 'reddit_name': 'politics'}, {'position': 3, 'description': '', 'name': 'programming', 'subscribers': 9386, 'reddit_name': 'programming'}, {'position': 4, 'description': 'Yeah reddit, you finally got it. Context appreciated.', 'name': 'Pictures and Images', 'subscribers': 4198, 'reddit_name': 'pics'}, {'position': 5, 'description': '', 'name': 'obama', 'subscribers': 651, 'reddit_name': 'obama'}]
>>>
>>> from pprint import pprint
>>> pprint(srs[3:5])
[{'description': 'Yeah reddit, you finally got it. Context appreciated.',
  'name': 'Pictures and Images',
  'reddit_name': 'pics',
  'subscribers': 4198},
 {'description': '',
  'name': 'obama',
  'reddit_name': 'obama',
  'subscribers': 651}]
>>>
>>> subreddits.print_subreddits_paragraph(srs[3:5])
position: 4
name: Pictures and Images
reddit_name: pics
description: Yeah reddit, you finally got it. Context appreciated.
subscribers: 4198

position: 5
name: obama
reddit_name: obama
description:
subscribers: 651
>>>
>>> subreddits.print_subreddits_json(srs[3:5])
[
    {
        "position": 4,
        "description": "Yeah reddit, you finally got it. Context appreciated.",
        "name": "Pictures and Images",
        "subscribers": 4198,
        "reddit_name": "pics"
    },
    {
        "position": 4,
        "description": "",
        "name": "obama",
        "subscribers": 651,
        "reddit_name": "obama"
    }
]

Or it can be called from the command line:

$ ./subreddits.py --help
usage: subreddits.py [options]

options:
  -h, --help  show this help message and exit
  -oOUTPUT    Output format: paragraph or json. Default: paragraph.
  -pPAGES     How many pages of subreddits to output. Default: 1.
  -n          Retrieve new subreddits. Default: nope.

This module reused the awesome BeautifulSoup HTML parser module, and simplejson JSON encoding module.

The second program I wrote is called 'redditstories.py' which accesses the specified subreddit and gets the latest stories from it. It was written pretty much the same way I did it for redditmedia project in Perl.

Get this program here: reddit stories extractor (redditriver.com project) (downloaded: 3580 times).

This module also provides three similar functions:

  • get_stories(subreddit='front_page', pages=1, new=False), which gets 'pages' pages of stories from subreddit and returns a list of dictionaries of them. If new is True, gets new stories only,
  • print_stories_paragraph(), which prints subreddits information in human readable format, and
  • print_stories_json(), which prints it in JSON format. The output is in utf-8 encoding.

It can also be used as a Python module or executable.

Here is an example of using it as a module:

>>> import redditstories
>>> s = redditstories.get_stories(subreddit='programming')
>>> len(s)
25
>>> s[2:4]
[{'title': "when customers don't pay attention and reply to a "donotreply.com" email address, it goes to Chet Faliszek, a programmer in Seattle", 'url': 'http://consumerist.com/371600/the-man-who-owns-donotreplycom-knows-all-the-secrets-of-the-world', 'unix_time': 1206408743, 'comments': 54, 'subreddit': 'programming', 'score': 210, 'user': 'srmjjg', 'position': 3, 'human_time': 'Tue Mar 25 03:32:23 2008', 'id': '6d8xl'}, {'title': 'mysql --i-am-a-dummy', 'url': 'http://dev.mysql.com/doc/refman/4.1/en/mysql-tips.html#safe-updates', 'unix_time': 1206419543, 'comments': 59, 'subreddit': 'programming', 'score': 135, 'user': 'enobrev', 'position': 4, 'human_time': 'Tue Mar 25 06:32:23 2008', 'id': '6d9d3'}]
>>> from pprint import pprint
>>> pprint(s[2:4])
[{'comments': 54,
  'human_time': 'Tue Mar 25 03:32:23 2008',
  'id': '6d8xl',
  'position': 3,
  'score': 210,
  'subreddit': 'programming',
  'title': "when customers don't pay attention and reply to a "donotreply.com" email address, it goes to Chet Faliszek, a programmer in Seattle",
  'unix_time': 1206408743,
  'url': 'http://consumerist.com/371600/the-man-who-owns-donotreplycom-knows-all-the-secrets-of-the-world',
  'user': 'srmjjg'},
 {'comments': 59,
  'human_time': 'Tue Mar 25 06:32:23 2008',
  'id': '6d9d3',
  'position': 4,
  'score': 135,
  'subreddit': 'programming',
  'title': 'mysql --i-am-a-dummy',
  'unix_time': 1206419543,
  'url': 'http://dev.mysql.com/doc/refman/4.1/en/mysql-tips.html#safe-updates',
  'user': 'enobrev'}]
>>> redditstories.print_stories_paragraph(s[:1])
position: 1
subreddit: programming
id: 6daps
title: Sign Up Forms Must Die
url: http://www.alistapart.com/articles/signupforms
score: 70
comments: 43
user: markokocic
unix_time: 1206451943
human_time: Tue Mar 25 15:32:23 2008

>>> redditstories.print_stories_json(s[:1])
[
    {
        "title": "Sign Up Forms Must Die",
        "url": "http:\/\/www.alistapart.com\/articles\/signupforms",
        "unix_time": 1206451943,
        "comments": 43,
        "subreddit": "programming",
        "score": 70,
        "user": "markokocic",
        "position": 1,
        "human_time": "Tue Mar 25 15:32:23 2008",
        "id": "6daps"
    }
]

Using it from a command line:

$ ./redditstories.py --help
usage: redditstories.py [options]

options:
  -h, --help   show this help message and exit
  -oOUTPUT     Output format: paragraph or json. Default: paragraph.
  -pPAGES      How many pages of stories to output. Default: 1.
  -sSUBREDDIT  Subreddit to retrieve stories from. Default:
               reddit.com.
  -n           Retrieve new stories. Default: nope.

These two programs just beg to be converted into a single Python module. They have the same logic with just a few changes in the parser. But for the moment I am generally happy, and they serve the job well. They can also be understood individually without having a need to inspect several source files.

I think that one of the future posts could be a reddit information accessing library in Python.

I can already think of one hundred ideas what someone can do with such a library. For example, one could print out top programming stories his or her shell:

$ echo "Top five programming stories:" && echo && ./redditstories.py -s programming | grep 'title' | head -5 && echo && echo "Visit http://reddit.com/r/programming to view them!"

Top five programming stories:

title: Sign Up Forms Must Die
title: You can pry XP from my cold dead hands!
title: mysql --i-am-a-dummy
title: when customers don't pay attention and reply to a "donotreply.com" email address, it goes to Chet Faliszek, a programmer in Seattle
title: Another canvas 3D Renderer written in Javascript

Visit http://reddit.com/r/programming to view them!

Creating and Populating the SQLite Database

The database choice for this project is SQLite, as it is fast, light and this project is so simple, that I can't think of any reason to use a more complicated database system.

The database has a trivial structure with just two tables 'subreddits' and 'stories'.

CREATE TABLE subreddits (
  id           INTEGER  PRIMARY KEY  AUTOINCREMENT,
  reddit_name  TEXT     NOT NULL     UNIQUE,
  name         TEXT     NOT NULL     UNIQUE,
  description  TEXT,
  subscribers  INTEGER  NOT NULL,
  position     INTEGER  NOT NULL,
  active       BOOL     NOT NULL     DEFAULT 1
);

INSERT INTO subreddits (id, reddit_name, name, description, subscribers, position) VALUES (0, 'front_page', 'reddit.com front page', 'since subreddit named reddit.com has different content than the reddit.com frontpage, we need this', 0, 0);

CREATE TABLE stories (
  id            INTEGER    PRIMARY KEY  AUTOINCREMENT,
  title         TEXT       NOT NULL,
  url           TEXT       NOT NULL,
  url_mobile    TEXT,
  reddit_id     TEXT       NOT NULL,
  subreddit_id  INTEGER    NOT NULL,
  score         INTEGER    NOT NULL,
  comments      INTEGER    NOT NULL,
  user          TEXT       NOT NULL,
  position      INTEGER    NOT NULL,
  date_reddit   UNIX_DATE  NOT NULL,
  date_added    UNIX_DATE  NOT NULL
);

CREATE UNIQUE INDEX idx_unique_stories ON stories (title, url, subreddit_id);

The 'subreddits' table contains information extracted by 'subreddits.py' module (described earlier). It keeps the information and positions of all the subreddits which appeared on the most popular subreddit page (http://reddit.com/reddits).

Reddit lists 'reddit.com' as a separate subreddit on the most popular subreddit page, but it turned out that it was not the same as the front page of reddit! That's why I insert a fake subreddit called 'front_page' in the table right after creating it, to keep track of both 'reddit.com' subreddit and reddit's front page.

The information in the table is updated by a new program - update_subreddits.py.

View: subreddit table updater (redditriver.com project) (downloaded: 2347 times)

The other table, 'stories' contains information extracted by 'redditstories.py' module (also described earlier).

The information in this table is updated by another new program - update_stories.py.

As it is impossible to keep track of all the scores and comments, and position changes across all the subreddits, the program monitors just a few pages on each of the most popular subreddits.

View: story table updater (redditriver.com project) (downloaded: 2299 times)

These two programs are run periodically by crontab (task scheduler in unix). The program update_subreddits.py gets run every 30 minutes and update_stories.py every 5 minutes.

Finding the Mobile Versions of Given Websites

This is probably the most interesting piece of software that I wrote for this project. The idea is to find versions of a website suitable for viewing on a mobile device.

For example, most of the stories on politics subreddit link to the largest online newspapers and news agencies, such as The Washington Post or MSNBC. These websites provide a 'print' version of the page which is ideally suitable for mobile devices.

Another example is websites who have designed a real mobile version of their page and let the user agent know about it by placing <link rel="alternate" media="handheld" href="..."> tag in the head section of an html document.

I wrote an 'autodiscovery' Python module called 'autodiscover.py'. This module is used by the update_stories.py program described in the previous section. After getting the list of new reddit stories, the update_stories.py tries to autodiscover a mobile version of the story and if it is successful, it places it in 'url_mobile' column of the 'stories' table.

Here is an example run from Python interpreter of the module:

>>> from autodiscovery import AutoDiscovery
>>> ad = AutoDiscovery()
>>> ad.autodiscover('http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969.html')
'http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969_pf.html'
>>> ad.autodiscover('http://www.msnbc.msn.com/id/11880954/')
'http://www.msnbc.msn.com/id/11880954/print/1/displaymode/1098/'

And it can also be used from command line:

$ ./autodiscovery.py http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969.html
http://www.washingtonpost.com/wp-dyn/content/article/2008/03/24/AR2008032402969_pf.html

Source: mobile webpage version autodisovery (redditriver.com project) (downloaded 4132 times)

This module actually uses a configuration file 'autodisc.conf' which defines patterns to look for in the web page's HTML code. At the moment the config file is pretty primitive and defines just three configuration options:

  • REWRITE_URL defines a rule how to rewrite URL of a website which makes it difficult to autodiscover the mobile link easily. For example, a page could use JavaScript to pop-up the print version of the page. In such a case REWRITE_URL rule can be used to match the host which uses this technique and rewrite part of the url to another.
  • PRINT_LINK defines how a print link might look like. For example, it could say 'print this page' or 'print this article'. This directive defines such phrases to look for.
  • IGNORE_URL defines urls to ignore. For example, a link to a flash animation should definitely be ignored, as it does not define a mobile version at all. You can place the .swf extension in this ignore list to avoid it being downloaded by autodiscovery.py.

Configuration used by autodiscovery.py: autodiscovery configuration (redditriver.com project) (downloaded 4039)

Creating the web.py Application

The final part to the project was creating the web.py application.

It was pretty straight forward to create it as it only required writing the correct SQL expressions for selecting the right data out of the database.

Here is how the controller for the web.py application looks like:

urls = (
    '/',                                 'RedditRiver',
    '/page/(\d+)/?',                     'RedditRiverPage',
    '/r/([a-zA-Z0-9_.-]+)/?',            'SubRedditRiver',
    '/r/([a-zA-Z0-9_.-]+)/page/(\d+)/?', 'SubRedditRiverPage',
    '/reddits/?',                        'SubReddits',
    '/stats/?',                          'Stats',
    '/stats/([a-zA-Z0-9_.-]+)/?',        'SubStats',
    '/about/?',                          'AboutRiver'
)

The first version of reddit river implements browsable front stories (RedditRiver and RedditRiverPage classes), browsable subreddit stories (SubRedditRiver and SubRedditRiverPage classes), list of the most popular subreddits (SubReddits class), front page and subreddit statistics (most popular stories and most active users, Stats and SubStats classes) and an about page (AboutRiver class).

The source code: web.py application (redditriver.com project) (downloaded: 4132 times)

Release

I have put it online! Click redditriver.com to visit the site.

I have also released the source code. Here are all the files mentioned in the article, and a link to the whole website package.

Download Programs which Made Reddit River Possible

All the programs in a single .zip:
Download link: full redditriver.com source code
Downloaded: 7359 times

Individual scripts:

Download link: subreddit extractor (redditriver.com project)
Downloaded: 5487 times

Download link: reddit stories extractor (redditriver.com project)
Downloaded: 3580 times

Download link: subreddit table updater (redditriver.com project)
Downloaded: 2347 times

Download link: story table updater (redditriver.com project)
Downloaded: 2299 times

Download link: mobile webpage version autodisovery (redditriver.com project)
Downloaded: 4132 times

Download link: autodiscovery configuration (redditriver.com project)
Downloaded: 4039 times

Download link: web.py application (redditriver.com project)
Downloaded: 3003 times

All these programs are released under GNU GPL license, so you may derive your own stuff, but do not forget to share your derivative work with everyone!

Vote for this article:

Alexis recently sent me a reddit t-shirt for doing redditmedia project, I decided to take a few photos wearing it :)

peteris krumins loves reddit

Have fun and I hope to hear a lot of positive feedback on redditriver project :)

Comments

March 26, 2008, 02:51

Great work, i especially love the fact that you try to automatically direct users to the print/mobile version of the remote site.

I had wanted to something like this myself, so thanks for killing a project of mine ^_^.

Also, while typing i just realized that your the same developer behind the Digg picture website, all credit to you, keep up the good (and very fast) work.

March 26, 2008, 03:06

Great work, will definitely be using this next time I am out and about.

Rodg Permalink
March 26, 2008, 03:30

Nice work. I have also been accessing a mobile version of reddit here:

http://m.phonefavs.com/reddit.com/.rss

The site takes reddit's rss feed through a mobile transcoder so all the links are made mobile.

idonthack Permalink
March 26, 2008, 04:11

looking through your autodiscover.py script, i did not see any attempts to search the page's header for a reference to a separate mobile stylesheet with link tags, or a "mobileoptimized" metatag as recognized by microsoft's mobile browser, both of which could be used to determine if the page actually needs autodiscovery

http://dev.mobi/node/403 - syntax for defining mobile and print stylesheets in html

http://msdn2.microsoft.com/en-us/library/bb431690.aspx - msdn page describing rendering modes of microsoft's mobile browser and the mobileoptimized metatag

idonthack Permalink
March 26, 2008, 04:13

oops. disregard that, i suck cocks.

i just didn't look very hard

ryan Permalink
March 26, 2008, 04:53

idonthack:

Firstly, let me say that I agree with your second comment!

Secondly, MSDN? Microsoft mobile browser? What ARE you talking about? Who uses that crap?

March 26, 2008, 05:27

Shouldn't the website use an infinite scroll if it's called *river.com...?

March 26, 2008, 05:34

Braydon, wow, what a fantastic idea! I'll work on it and see if I can get it done easily! Genius!

March 26, 2008, 10:01

Wonderful work Peteris, just wonderful. I like the way you stuffed things up there, self redirection and many other features.

Maybe do a skinning plugin/sub-application for iPhone users or something?

I really wish you a big gift from the reddit.com people; let's see if they will get you the first Lamborghini.

serkan Permalink
March 26, 2008, 10:26

I didn't read the whole article, sorry, but I use diggriver all the time and, well digg is going to crap these days so i've switched to reddit.

thank you so much

Cian Permalink
March 26, 2008, 20:44

Oops, maybe http://pypi.pyhthon.org/pypi/flup/1.0 should be http://pypi.python.org/pypi/flup/1.0

Really interesting article. Thanks
C

April 22, 2008, 17:45

Nice work! I know that Google has some service that converts regular pages to mobile versions of them, so you can use it when the page doesn't have a mobile version of his own.

Ralph Corderoy Permalink
July 13, 2008, 13:14

WRT the #include regexp, perhaps your template had whitespace after the closing double-quote, or an ASCII CR? We'd need a `grep '#include' template | od -c' to diagnose further.

Ah, OK, having got the ZIP'd source I see you've ASCII CRs.

$ g '#include' * | cat -A
about.tpl.html:#include "common.header.tpl.html"^M$
March 17, 2010, 13:37

It the site discontinued or temporarily offline?

sandong Permalink
October 27, 2013, 15:38

dapatkan permainan online gratis hanya disini. Kumpulan game online terlengkap semuanya ada disini.

Leave a new comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the word "browser_44": (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.

Advertisements