This article is part of the article series "Creating picurls.com."
<- previous article next article ->
The making of picurls.com, the Popurls for Pictures: buzziest pics

This is part one of a two part series on how I am going to develop picurls.com, a popurls.com like website for the buzziest pics on the net.

As you remember, a few weeks ago I created redditmedia.com (now moved to http://reddit.picurls.com) and digpicz.com. These sites were fun to make. I love to check them daily and people love them as well!

While I was creating the second site, digpicz, it struck me - why not create a single site similar to popurls which aggregates posts to pictures from many sources?

Let's get hands on it!

Update! Picurls.com launches:
picurls - picture buzz, buzziest pics on the net

ps. I am terribly sorry for the delay between this post and the previous one. The last year at university has begun for me and I am pushing it to the extremes with number of subjects a person can take. I'm taking 4 additional math courses to the 5 required physics courses (total 9 courses this term). I'm extremely busy and tired in the evenings after I get back home. The future posts might also not be as regular as before...

Technical Design of Picurls

Technically the website is very easy to create. It will use a bunch of Perl programs which periodically check for new posts on various social news and bookmarking sites, insert these posts into the database and create a small picture thumbnails for each post.

This project can now reuse components from diggpics and redditmedia. Digpicz and reddit media each had their own post/story extractor which scraped the content from these sites.

If you looked closely at these scripts, you would notice that a lot of code is almost identical in both of them:

Each site also used the same set of scripts for finding the "best" picture on the website, retrieving it and creating a thumbnail out of this pic. These scripts will also get reused.

The first version of the Picurls website is going to scrape contents from 9 sources:

I really don't want to duplicate the same code over and over again, creating an extractor program for each of the sites. I better write a more generic plugin-based data scraper which can easily be extended and reused (read more about it below). These digg and reddit extractors can now be turned into scraper plugins easily.

The data scraper program will output posts in a human readable format ready for input to a program which inserts the posts into a temporary picture database.

Once the posts are in the database, another program will run and give a few tries to extract a thumbnail for the post. If it succeeds it will move the post from the temporary picture table to real picture table.

The user interface of the site will be done in PHP and will use data from the real picture table. To minimize load on the server, the website pages will be cached. Designing a primitive caching system in PHP is very simple, but there are strong libraries already which do it for us. One of them is Smarty. Smarty is a template engine for PHP. More specifically, it facilitates a manageable way to separate application logic and content from its presentation. Not only that, it can easily be instructed to cache the pages.

Digg pics and redditmedia were completely pre-generated by a Perl program and the web server had to serve just regular HTML files. The sites had less functionality but they handled traffic well. It will be almost the same with picurls, where the Smarty will serve the cached HTML contents and only when a new picture of a new comment is added will the cache be flushed. This way I do not have to worry about server getting too loaded.

Using PHP as the server side language for the user interface of the website will make it a much more lively than redditmedia or digpicz. I want the new site to have comments, most viewed pics, most commented pics, search functionality, voting and more features later.

The first version of picurls will have comments but the other features will come later. Also I want to write it in a manner that the programs can be reused in other projects.

Designing the Generic Plugin-Based Data Scraper

The basic idea of the data scraper is to crawl websites and to extract the posts in a human readable output format. I want it to be easily extensible via plugins and be highly reusable. Also I want the scraper to have basic filtering capabilities to select just the posts which I am interested in.

There are two parts to the scraper - the scraper library and the scraper program which uses the library and makes it easier to scrape many sites at once.

The scraper library consists of the base class 'sites::scraper' and plugins for many various websites. For example, Digg's scraper plugin is 'sites::digg' (it inherits from sites::scraper).

The constructor of each plugin takes 4 optional arguments - pages, vars, patterns or pattern_file.

  • pages - integer, specifies how many pages to scrape in a single run,
  • vars - hashref, specifies parameters for the plugin,
  • patterns - hashref, specifies string regex patterns for filtering posts,
  • pattern_file - string, path to file containing patterns for filtering posts.

Here is a Perl one-liner example of scraper library usage (without scraper program). This example scrapes 2 most popular pages of stories from Digg's programming section, filtering just the posts matching 'php' (case insensitive):

perl -Msites::digg -we '$digg = sites::digg->new(pages => 2, patterns => { title => [ q/php/ ], desc => [ q/php/ ] }, vars => { popular => 1, topic => q/programming/ }); $digg->scrape_verbose'

In this example we scrape two pages of all popular digg posts (made to front page) in programming category (topic) which have 'php' in either title or description of the post.

Here is the output of the plugin:

comments: 27
container_name: Technology
container_short_name: technology
description: With WordPress 2.3 launching this week, a bunch of themes and plugins needed updating. If you're not that familiar with <strong>PHP</strong>, this might present a slight problem. Not to worry, though - we've collected together 20+ tools for you to discover the secrets of <strong>PHP</strong>.
human_time: 2007-09-26 18:18:02
id: 3587383
score: 921
status: popular
title: The <strong>PHP</strong> Toolbox: 20+ <strong>PHP</strong> Resources
topic_name: Programming
topic_short_name: programming
unix_time: 1190819882
url: http://mashable.com/2007/09/26/php-toolbox/
user: ace77
user_icon: http://digg.com/users/ace77/l.png
user_profileviews: 17019
user_registrered: 1162332420
site: digg

comments: 171
container_name: Technology
container_short_name: technology
description: "Back in January 2005, I announced on the O'Reilly blog that I was going to completely scrap over 100,000 lines of messy <strong>PHP</strong> code in my existing CD Baby (cdbaby.com) website, and rewrite the entire thing in Rails, from scratch."Great article.
human_time: 2007-09-23 06:47:38
id: 3548227
score: 1653
status: popular
title: 7 Reasons I Switched Back to <strong>PHP</strong> After 2 Years on Rails
topic_name: Programming
topic_short_name: programming
unix_time: 1190519258
url: http://www.oreillynet.com/ruby/blog/2007/09/7_reasons_i_switched_back_to_p_1.html
user: Steaminx
user_icon: http://digg.com/users/Steaminx/l.png
user_profileviews: 14083
user_registrered: 1104849214
site: digg

Each story is represented as a paragraph of key: value pairs. In this case the scraper found 2 posts matching PHP.

Any program taking this output as input is free to choose parts of information they want to use.

It is guaranteed that each plugin produces output with at least 'title', 'url' and 'site' fields.

The date of the post is extracted, if available, is extracted by two fields 'unix_time' and 'human_time'.

In Picurls case, I am interested in 'title', 'url', 'unix_time' and 'site' fields.

To create a plugin, one must override just three methods from the base class:

  • site_name - method should return a unique site id which will be output in each post as 'site' field,
  • get_page_url - given a page number, the method should construct a URL to the page containing posts,
  • get_posts - given the content of the page located at last get_page_url call, the subroutine should return an array of hashrefs containing key => val pairs containing the post information

It's very difficult to document everything the library does. It would take a few pages of documentation to document this simple library. If you are more interested in it, please take a look at the sources.

Take a look at digg.com website scraper plugin digg.pm to see how trivial it is to write a new plugin:
See generic plugin-based scraper (digg.com scraper plugin) (downloaded: 2924 times)

Here is the base class of scraper library scraper.pm:
See generic plugin-based scraper (base class) (downloaded: 1866 times)

The scraper program takes a bunch of command line arguments and calls each plugin in turn generating huge amount of output.

The program is called scraper.pl. Running it without arguments prints its basic usage:

Usage: ./scraper.pl <<strong>site</strong>[:<strong>M</strong>][:{<strong>var1=val1; var2=val2 ...</strong>}]> ... [/path/to/pattern_file]
Crawls given sites extracting entries matching optional patterns in pattern_file
Optional argument M specifies how many pages to crawl, default 1
Arguments (variables) for plugins can be passed via an optional { }

The arguments in { } get parsed and then get passed to constructor of site. Also a number of sites can be scraped at once.

For example, running the program with the following arguments:

./scraper.pl reddit:2:{subreddit=science} stumbleupon:{tag=photography} picurls.txt

Would scrape two pages of science.reddit.com and a page of StumbleUpon website tagged 'photography' and use filtering rules in the file 'picurls.txt'.

This is how the output of this program looks:

desc: Morning Glory at rest before another eruption, Yellow Stone National Park.
human_time: 2007-02-14 04:34:41
title: public-domain-photos.com/free-stock-photos-4/travel/yellowstone/m...
unix_time: 1171420481
url: http://www.public-domain-photos.com/free-stock-photos-4/travel/yellowstone/morning-glory-pool.jpg
site: stumbleupon

desc: Time for yur Bath
human_time: 2007-10-10 04:34:41
title: woostercollective.com/2007/07/16/giantduck1.jpg
unix_time: 1191980081
url: http://www.woostercollective.com/2007/07/16/giantduck1.jpg
site: stumbleupon

human_time: 2007-10-13 15:34:42
id: 2zq0v
score: 4
title: Sharpest image of Pluto ever taken
unix_time: 1192278882
url: http://www.badastronomy.com/bablog/2007/10/12/sharpest-image-of-pluto-ever-taken/
user: clawoo
site: reddit

Here is the scraper.pl program itself:
See generic plugin-based scraper (scraper program) (downloaded: 1876 times)

Here is how the filter file picurls.txt for Picurls looks like:

# match picture urls
#
url: \.jpg$
url: \.gif$
url: \.png$

# match common patterns describing posts having pictures in them
#
[[(].*picture.*[])]
[[(].*pic.*[])]
[[(].*image.*[])]
[[(].*photo.*[])]
[[(].*comic.*[])]
[[(].*chart.*[])]
[[(].*graph.*[])]

photos? of
pics? of
images? of
pictures? of
comics? of
charts? of
graphs? of
grapics? of
(this|these|those) photos?
(this|these|those) pics?
(this|these|those) images?
photosets? (on|of)

# match domains containing just pics
url: xkcd\.com
url: flickr\.com
url: photobucket\.com
url: imageshack\.us
url: bestpicever\.com

The format of the file is the following:
[url: |title: |desc: ]regex_pattern

url:, title: and desc: are optional. They specify if the entry on a website should be matched against its url, title or description.

If neither url:, title: and desc: are specified, it defaults to matching pattern against title and description.

Update: I added another format field 'perl:' which allows to write a filter predicate as an anonymous Perl subroutine which gets called on each item scraped. A predicate is a function which returns a true or false for a given input. In our case, if the predicate returns true, the item gets accepted (and printed as output later), if false, then the next predicate (if any) is considered and the same rule applies.

Here is an example of filter file which defines a single predicate, which filters out most of the URLs which point to root location of the site (not likely to contain an interesting picture):

# Discard items which point to index pages
#
perl: sub {
    use URI;
    my $post = shift;

    my $uri = URI->new($post->{url});
    my $path = $uri->path;

    if (!length $path) { # empty path
        return 0;
    }
    elsif ($path =~ m!^/+$!) { # just a slash '/'
        return 0;
    }
    elsif ($path =~ m!^/(home|index)\.(php|html|htm|aspx?)$!i) { # some index files
        return 0;
    }

    return 1;
}

And here is the whole scraper package with the scraper library, 9 plugins and the scraper program:

Download Generic Plugin-Based Website Scraper

All the scripts in a single .zip:
Download link: generic plugin-based scraper (whole package)
Downloaded: 489 times

The second part of the article will discuss the database design and the user interface of picurls.com.

Update: Click here to view Part II of the article.

Hosting

ZigZap Technologies have kindly provided me with a server hosting for all my needs. If you are looking for hosting solutions, they are excellent!

Until next time!

This article is part of the article series "Creating picurls.com."
<- previous article next article ->

Comments

TWT Permalink
September 24, 2007, 01:03

Very cool article. Man you have class. I'm looking forward to Part 2.

September 25, 2007, 22:10

TWT, thanks for kind words :)
The next part will be hopefully out this week. (not sure 100%, though)

December 30, 2007, 03:32

You might also want to check out http://www.GetMyScoop.com. It is a newly created web-based customizable feed aggregator that also allows users to post their own content. I am the owner and developer of the website so if you are wondering about my objectivity of this post, I encourage you to visit the site yourself and check out its features. I built it as a proof of concept to learn first hand what you can do nowadays with very limited resources (only myself working evenings on and off for about 6 months). If you have any feedback, please use the "contact us" page on the website to get in touch with me. Popurls was one of the inspirations. My aim was to create a website that is completely free to use and empower users to the fullest extent possible in giving them the freedom they deserve with regard to both consuming and creating Internet content. I hope this goal is achieved as the website gets more and more popular and as I add more features in the future. Your feedback is greatly appreciated.

January 03, 2008, 23:45

Excellent post , thanks

August 02, 2008, 19:24

grazie per l'articolo...interessante

haveacygar Permalink
November 07, 2008, 03:39

Has anyone managed to get scraper.pl working in Cygwin? I have all the required modules installed but
scraper still cannot find them.

November 07, 2008, 05:56

Where did picurls.com go?

balticman Permalink
August 30, 2009, 18:44

please, renew some links in this articles :) for examples: picurls.com :)

moncler Permalink
July 02, 2010, 07:20

The next part will be hopefully out this week.

Satishguru Permalink
September 03, 2010, 07:56

I personally like http://www.infonary.com. They have a good interface and apt information. You can browse through topics like business, religion etc. This site also provides a search functionality :-)

Leave a new comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the word "halflife3": (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.

Advertisements