Follow me on Twitter for my latest adventures!
During my usage of reddit, I have observed that many titles have "(Pic)" or "[Picture]", "(Video)", etc. after them. It means that the contents the link points to has a picture or video in it. Sometimes I want to have fun with my friends and go through all the pics or vids. Unfortunately reddit's search is broken and there is really no good way to see the best pics and videos voted on reddit in the past.
I decided to create reddit media site which will monitor reddit's front page, collect picture & video links, and build an archive of them over time.
In (read more about it on about this blog page) page I wrote about one of the methods I like to use when developing software (and this project requires writing a few tools quickly, more about them below). It is call "the hacker's approach". The hacker's approach method is basically writing software as fast as possible using everything available and not thinking much about the best development practices, and not worrying what others will think about your code. If you are a good programmer the code quality produced is just a bit worse than writing it carefully but the time saved is enormous.
I will release full source code of website with all the programs generating the website. Also I will blog how the tools work and what ideas I used.
Update: Done! The site is up at reddit media: intelligent fun online.
Reddit Media Website's Technical Design Sketch
I use DreamHost shared hosting to run this website. Overall it is great hosting company and I have been with them for more than a year now! Unfortunately since it is a shared hosting, sometimes the server gets overloaded and serving of dynamic pages can become slow (a few seconds to load).
I want the new website to be as fast as possible even when the server is a bit loaded. I do not want any dynamic parsing to be involved when accessing the website. Because of this I will go with generating static HTML pages.
A Perl script will run every 30 mins from crontab, get reddit.com website, extract titles and URLs. Another script will add the titles to the lightweight sqlite on-disk database in case I ever want to make the website dynamic. And the third script will use the entries in the database and generate HTML pages.
A knowledgeable user might ask if this design does not have a race-condition at the moment the new static page is generated and user requesting the same page. The answer is no. The way new pages will be generated is that they will be written to temporary files, then moved in place of the existing ones. The website runs on Linux operating system and by looking up `man 2 rename' we find that
If newpath already exists it will be atomically replaced (subject to a
few conditions - see ERRORS below), so that there is no point at which
another process attempting to access(2,5) newpath will find it missing.
rename system call is atomic which means we have no trouble with race conditions!
Reddit provides RSS feed to the front page news. It has 25 latest news and maybe 5 are media links. That is not enough links to launch the website. People visiting the site will get bored with just 5 links and a few new added daily. I need more content right at the moment I launch the site. Or I could to launch the site later when articles have piled up. Unfortunately, I do not want to wait and I want to launch it ASAP! The hacker's approach!
First, I will create a script which will go through all the pages on reddit looking for picture and video links, and insert the found items in the database. It will match patterns in link titles and will match domains which exclusively contain media.
Here is the list of patterns I could come up with which describe pictures and videos:
And here are the domains found on youtube which exclusively contain media:
This script will output the found items in human readable format, ready for input to another script which will absorb this information and put it in the SQLite database.
This script is called 'reddit_extractor.pl'. It takes one optional argument which is number of reddit pages to extract links from. If no argument is specified, it goes through all reddit pages until it hits the last one. For example, specifying 1 as the first argument makes it parse just the front page. I can now run this script periodically to find links on the front page. No need for parsing RSS.
There is one constant in this script which can be changed. This constant, VOTE_THRESHOLD, sets the threshold of how many votes a post on reddit should have received to be collected by our program. I had to add it because when digging in older reddit's posts, media with 1 or 2 votes can be found which means it really wasn't that good.
The script outputs each media post matching a pattern or domain in the following format:
title (type, user, reddit id, url)
- title is the title of the article
- type is the media type. It can be one of 'video', 'videos', 'picture', 'pictures'. It's plural if the title contains "pics" or "videos" (plural) form of media.
- user is the reddit user who posted the link
- reddit id is the unique identifier reddit uses to identify its links
- url is the url to the media
Script 'reddit_extractor.pl' can be viewed here:
reddit extractor (perl script, reddit media generator)
Then I will create a script which takes this input and puts it into SQLite database. It is so trivial that there is nothing much to write about it.
The script will create an empty database on the first invocation, read the data from stdin and insert the data in the database.
The database design is dead simple. It contains just two tables:
- reddit which stores the links found on reddit, and
- reddit_status which contains some info about how the page generator script used the reddit table
Going into more details, reddit table contains the following colums:
- id - the primary key of the table
- title - title of the media link found on reddit
- url - url to the media
- reddit_id - id reddit uses to identify it's posts (used by my scripts to link to comments)
- user - username of the person who posted the link on reddit
- type - type of the media, can be: 'video', 'videos', 'picture', 'pictures'. It's plural if the title contains "pics" or "videos" (plural) form of media.
- date_added - the date the entry was added to the database
The other table, reddit_status contains just two colums:
- last_id - the last id in the reddit table which the generator script used for generating the site
- last_run - date the of last successful run of the generator script
This script is called 'db_inserter.pl'. It does not take any arguments but has one constant which has to be changed before using. This constant, DATABASE_PATH, defined the path to SQLite database. As I mentioned, it is allowed for the database not to exist, this script will create one on the first invocation.
These two scripts used together can now be periodically run from crontab to monitor the reddit's front page and insert the links in the database. It can be done with as simple command as:
reddit_extractor.pl 1 | db_inserter.pl
Script 'db_inserter.pl' ca be viewed here:
db inserter (perl script, reddit media generator)
Now that we have our data, we just need to display it in a nice manner. That's the job of generator script.
The generator script will be run after the previous two scripts have been run together and it will use information in the database to build static HTML pages.
Since generating static pages is computationally expensive, the generator has to be smart enough to minimize regeneration of already generated pages. I commented the algorithm (pretty simple algorithm) that minimizes regeneration script carefully, you can take a look at 'generate_pages' function in the source.
The script generates three kinds of pages at the moment - pages containing all pictures and videos, pages containing just pictures and pages containing just videos.
There is a lot of media featured on reddit and as the script keeps things cached, the directory sizes can grow pretty quickly. If a file system which performs badly with thousands of files in a single directory is used, the runtime of the script can degrade. To avoid this, the generator stores cached reddit posts in subdirectories based on the first char of their file name. For example, if a filename of a cached file is 'foo.bar', then it stores the file in /f/foo.bar directory.
The other thing this script does is locate thumbnail images for media. For example, for YouTube videos, it would construct URL to their static thumbnails. For Google Video I could not find a public service for easily getting the thumbnail. The only way I found to get a thumbnail of Google Video is to get the contents of the actual video page and extract it from there. The same applies to many other video sites which do not tell developers how to get the thumbnail of the video. Because of this I had to write a Perl module 'ThumbExtractor.pm', which given a link to a video or picture, extracts the thumbnail.
'ThumbExtractor.pm' module can be viewed here:
thumbnail extractor (perl module, reddit media generator)
Some of the links on reddit contain the link to actual image. I wouldn't want the reddit media site to take long to load, that's why I set out to seek a solution for caching small thumbnails on the server the website is generated.
I had to write another module 'ThumbMaker.pm' which goes and downloads the image, makes a thumbnail image of it and saves to a known path accessible from web server.
'ThumbMaker.pm' module can be viewed here:
thumbnail maker (perl module, reddit media generator)
To manipulate the images (create thumbnails), the ThumbMaker package uses Netpbm open source software.
Netpbm is a toolkit for manipulation of graphic images, including conversion of images between a variety of different formats. There are over 300 separate tools in the package including converters for about 100 graphics formats. Examples of the sort of image manipulation we're talking about are: Shrinking an image by 10%; Cutting the top half off of an image; Making a mirror image; Creating a sequence of images that fade from one image to another.
You will need this software (either compile yourself, or get the precompiled packages) if you want to run the the reddit media website generator scripts!
To use the most common image operations easily, I wrote a package 'Netpbm.pl', which provides operations like resize, cut, add border and others.
'Netpbm.pm' package can be viewed here:
netpbm image manipulation (perl module, reddit media generator)
I hit an interesting problem while developing the ThumbExtractor.pm and ThumbMaker.pm packages - what should they do if the link is to a regular website with just images? There is no simple way to download the right image which the website wanted to show to users.
I thought for a moment and came up with an interesting but simple algorithm which finds "the best" image on the site.
It retrieve ALL the images from the site and find the one with biggest dimensions and make a thumbnail out of it. It is pretty obvious, pictures posted on reddit are big and nice, so the biggest picture on the site must be the one that was meant to be shown.
A more advanced algorithm would analyze it's location on the page and add weigh to the score of how good the image is, depending on where it is located. The more in the center of the screen, the higher score.
For this reason I developed yet another Perl module called 'ImageFinder.pm'. See the 'find_best_image' subroutine to see how it works!
'ImageFinder.pm' module can be viewed here:
best image finder (perl module, reddit media generator)
The generator script also uses CPAN's Template::Toolkit package for generating HTML pages from templates.
The name of the generator script is 'page_gen.pl'. It takes one optional argument 'regenerate' which if specified clears the cache and regenerates all the pages anew. It is useful when templates are updated or changes are made to thumbnail generator.
Program 'page_gen.pl' can be viewed here:
reddit media page generator (perl script)
While developing any piece of software I like solving various problems on paper. For example, with this site I had to solve problem how to regenerate existing pages minimally and how to resize thumbnails so they looked nice.
Here is how the sheet on which I took small notes looked like after the site got published:
The final website is at redditmedia.com address (now moved to http://reddit.picurls.com). Click http://reddit.picurls.com to visit it!
Here are all the scripts packed together with basic documentation:
Download Reddit's Media Site Generator Scripts
All the scripts in a single .zip:
Download link: reddit media website generator suite (.zip)
Downloaded: 1820 times
Download link: reddit extractor (perl script, reddit media generator)
Downloaded: 5059 times
Download link: db inserter (perl script, reddit media generator)
Downloaded: 3695 times
Download link: reddit media page generator (perl script)
Downloaded: 3484 times
Download link: thumbnail extractor (perl module, reddit media generator)
Downloaded: 4751 times
Download link: thumbnail maker (perl module, reddit media generator)
Downloaded: 3895 times
Download link: best image finder (perl module, reddit media generator)
Downloaded: 3966 times
Download link: netpbm image manipulation (perl module, reddit media generator)
Downloaded: 4038 times
For newcomers - What is reddit?
For newcomers, reddit is a social news website where users decide its contents.
From their faq:
What is reddit?
A source for what's new and popular on the web -- personalized for you. We want to democratize the traditional model by giving editorial control to the people who use the site, not those who run it. Your votes train a filter, so let reddit know what you liked and disliked, because you'll begin to be recommended links filtered to your tastes. All of the content on reddit is from users who are rewarded for good submissions (and punished for bad ones) by their peers; you decide what appears on your front page and which submissions rise to fame or fall into obscurity.
Have fun with the website and please tell me what do you think about it in the comments! Thanks :)