bash readline emacs editing mode default keyboard shortcut cheat sheetWhen you are working in a shell you certainly don't want to waste your time using arrow keys or home/end keys to navigate around the command line. One of the most popular shells, bash - Bourne Again SHell, uses GNU's Readline library for reading the command line.

The GNU Readline library provides a set of functions for use by applications that allow users to edit command lines as they are typed in. The readline library also includes functions to maintain a list of previously-entered command lines, to recall and perhaps reedit those lines, and perform csh-like history expansion on previous commands. Both emacs and vi editing modes are available.

I have mastered both of the editing modes and have created cheat sheets for both of them (and a tiny separate one for readline's history expansion).

This is a cheat sheet for the default, emacs, editing mode.

Here are a few examples with screenshots on how to use this editing mode.

Let '[]' be the position of cursor in all the examples.

Example 1: movement basics

Suppose you are at the end of the line and want to move 3 words backwards.

$ echo word1 word2 word3 word4 word5 word6[]

If you hit M-3 followed by M-b, you would end up exactly where you wanted:

$ echo word1 word2 word3 []word4 word5 word6

An alternative is to hit M-b three times in a row: M-b M-b M-b

If you look up on the cheat sheet what M-3 does, it sets the numeric-argument to 3 which in this case acts as a counter how many times should M-b command be repeated. The M-b command calls backward-word function which does the obvious.

The numeric-argument can also be negative, which makes the argument to be applied in the opposite direction.

Other shortcuts of interest are M-f to move forward and C-a, and C-e to move to the beginning and end of line.

Example 2: command history

Suppose you used a pretty complex command a while ago and now you remember just a few arguments of this command. You want to find this command and call it with a few arguments modified.

If you hit C-r readline will put you in an incremental reverse history search mode. Typing a part of the arguments you remember, will locate the previously executed command matching the typed text. Hitting C-r again will locate any other command which matches your typed text.

To put the found command on command line for editing hit C-j.

Example 3: completing

Suppose you want to quickly list all the users on the system.

Hit C-x ~ and read-line will attempt username completion and output all the usernames to the terminal.

$ []
adm        catonmat   ftp        halt       mailnull   nobody     root       smmsp      vcsa
apache     cpanel     games      lp         mysql      nscd       rpc        sshd
bin        daemon     gopher     mail       named      operator   rpm        sync
cat        dbus       haldaemon  mailman    news       picurls    shutdown   uucp
$ []

Suppose you now want to quickly list all the users on the system starting with 'm'. You can type 'm' followed by the same C-x ~ to do that.

$ m[]
mail      mailman   mailnull  mysql
$ m[]

The other interesting completions are:

  • C-x / which lists possible filename completion,
  • C-x $ which lists possible bash variable completion,
  • C-x @ which lists possible hostname completion and,
  • C-x ! which lists possible command completion.


  • Meta-/ which does filename completion,
  • Meta-$ which does bash variable completion,
  • Meta-@ which does hostname completion and,
  • Meta-! which does command completion.

Example 3: killing and yanking

Suppose you have to type a-long-word-like-this a couple of times.

The easiest way to do this is to kill the word, which puts it into the kill ring. Contents of the kill ring can be accessed by yanking.

For example, type 'a-long-word-like-this' in the shell:

$ command a-long-word-like-this []

Now press C-w to kill one word backward:

$ command []

Press C-y to yank (paste) the word as many times as you wish (I pressed it 3 times here:)

$ command a-long-word-like-this a-long-word-like-this a-long-word-like-this []

The kill ring does not contain just the one latest killing. It can be filled with a number of kills and rotated with M-y shortcut.

Another example:

Suppose you typed a longer command and you noticed that part of the THE TEXT GOT TYPED IN CAPITAL LETTERS. Without knowing the readline shortcuts you would erase the text and probably type it again. Now you can use the readline keyboard shortcuts and change the case very, very quickly.

You can use the following shortcuts to accomplish this:

1) M-l (Meta-l (on your computer, probably ESC-l)) shortcut is bound to readline's downcase-word function which lowercases the current word.
2) M-b shortcut is bound to readline's backward-word function which moves the cursor one word backwards.
3) M-<number> shortcut is bound to readline's numeric-argument function which in some cases acts as how many times should the following command be repeated.

Here is a real word example, suppose we have typed the following ([] is the cursor):


To get to the beginning of 'THE' we might repetitively hit M-b seven times or we could set the numeric argument to seven by typing M-7 and then hit M-b once.

After doing this the cursor would have moved before the word 'THE':


Now, by setting the numerical argument to 7 again and by pressing M-l or by pressing M-l seven times, we turn the text all in lower case.

$ echo the text. the text got typed in capital letters[]

Actually what we did in this example was not as efficient as it could have been. The numeric-argument shortcut accepts negative arguments which turn the direction of the following command in other direction. We could have turned the text in lower case by hitting M--7 and M-l

If you really want to be more productive, I suggest you play around with the commands in the cheat sheet for a while.
My previous article on being more productive on the command line was screen's cheat sheet which allows to emulate multiple terminals in a single window. You can take a look at it as well!

Download Emacs Editing Mode Cheat Sheet

PDF format (.pdf):
Download link: readline emacs cheat sheet (.pdf)
Downloaded: 68165 times

ASCII .txt format:
Download link: readline emacs cheat sheet (.txt)
Downloaded: 15216 times

LaTeX format (.tex):
Download link: readline emacs cheat sheet (.tex)
Downloaded: 7341 times

This cheat sheet is released under GNU Free Document License.

The next cheat sheet will be readline's vi editing mode's default keyboard shortcut cheat sheet! :)

extract mp3 audio track from youtube videoA few days ago my blog reader, Ankush Agarwal, on the comments of downloading youtube videos with gawk article asked:

I've seen tools available to download just the audio from a youtube video, in various formats; but as per your explanation it seems, that the audio is integrated with the video in the .swf file. How can we extract only the audio part and have it converted to a format like mp3?

As I have written a few articles before on how to download YouTube videos with Perl, gawk and VBScript, and how to convert the downloaded flash video files (flv) to divx or xvid, or any other format with ffmpeg, it was very easy to help this guy.

This is a guide that explains how to extract audio tracks from any videos, not just YouTube.

First, lets download the ffmpeg tool (that's for Windows Operating System. If you are using linux operating system, you can get the ffmpeg tool as a package distribution) and open the ffmpeg documentation in another window.

Lets choose a sample video which we will extract the audio track from. I found some music video clip "My Chemical Romance - Famous Last Words" (

Now, lets download the music video. If you are on a windows machine, you may use my VBScript program to download the video (download vbscript youtube video downloader, read how to use it here), or if you are on linux, you may use gawk program to download the video (download gawk youtube video downloader, read how to use it here).

After downloading the video, I ended up with a file named My_Chemical_Romance_-_Famous_Last_Words.flv.

Once you have downloaded the video, just for the sake of interest, lets find out the audio quality of this You Tube audio video.
The ffmpeg documentation does not tell us about a switch which would just output the audio parameters of the input file. After experimenting a little with the ffmpeg tool, it can be found that by just specifying '-i' switch and the input video file, the ffmpeg will output input streams information and quit.

Here is an example of how it looks:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv

Seems that stream 1 comes from film source: 1000.00 (1000/1) -> 24.00 (24/1)
Input #0, flv, from 'My_Chemical_Romance_-_Famous_Last_Words.flv':
  Duration: 00:04:27.4, start: 0.000000, bitrate: 64 kb/s
  Stream #0.0: Audio: mp3, 22050 Hz, mono, 64 kb/s
  Stream #0.1: Video: flv, yuv420p, 320x240, 24.00 fps(r)
Must supply at least one output file

From this information (2nd line in bold) we can read that the audio bitrate of a YouTube video is 64kbit/s, sampling rate is 22050Hz, the encoding is mp3, and it's a mono audio.

You will be surprised how easy it is to extract the audio part as it is in the video. By just typing:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv famous_last_word.mp3

the ffmpeg tool will extract it to an mp3 audio file!

That's it! After running this command you should have 'famous_last_words.mp3' file in the same folder/directory where the downloaded video file was!

We can go a little further and look up various audio switches on the documentation of ffmpeg. For example, if we had some fancy alarm clock which can be stuffed an mp3, you might not need the whole 64kbit/s of bitrate. You might want to convert the audio to a lower bitrate, say 32kbit/s.

The Section 3.5 - Audio Options of the ffmpeg documentation says:

`-ab bitrate' - Set the audio bitrate in bit/s (default = 64k).

So, by specifying a command line switch '-ab 32k' the audio will be converted to a lower bitrate of 32kbit/s.

Here is the example of running this command:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv -ab 32k famous_last_word.32kbit.mp3
Seems that stream 1 comes from film source: 1000.00 (1000/1) -> 24.00 (24/1)
Input #0, flv, from 'My_Chemical_Romance_-_Famous_Last_Words.flv':
  Duration: 00:04:27.4, start: 0.000000, bitrate: 64 kb/s
  Stream #0.0: Audio: mp3, 22050 Hz, mono, 64 kb/s
  Stream #0.1: Video: flv, yuv420p, 320x240, 24.00 fps(r)
Output #0, mp3, to 'famous_last_word.32kbit.mp3':
  Stream #0.0: Audio: mp3, 22050 Hz, mono, 32 kb/s
Stream mapping:
  Stream #0.0 -> #0.0
size=    1045kB time=267.6 bitrate=  32.0kbits/s
video:0kB audio:1045kB global headers:0kB muxing overhead 0.000000%

The line in bold indicates that the output audio indeed was at a bitrate of 32kbit/s.

Some other things you can do are - changing the codec of the audio (-acodec option (find all codecs with -formats option)) or cut out a part of the audio (-t and -ss options) you are interested in.

This technique actually involved re-encoding the audio which was already in the movie file. If you read closely the audio option documentation, you will find that the -acodec option says:

`-acodec codec' - Force audio codec to codec. Use the copy special value to specify that the raw codec data must be copied as is.

If the input video file was from YouTube or it already had mp3 audio stream, then using the following command line, the audio will be extracted much, much faster:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv -acodec copy famous_last_words.mp3

Have fun ripping your favorite music off YouTube! :)

ps. Do you have something cool and useful you would like to accompish but do not have the necessary computer skills? Let me know in the comments and I will see if I can write an article about it!

This article is part of the article series "Creating"
<- previous article next article ->

The making of, the Popurls for Pictures: buzziest picsThis is part II of II of the article how was created. In part one we made a universal plugin-based website scraper in Perl to get posts from social news and social bookmarking websites.

In this part I will describe the database design and how the user interface was done in PHP and Smarty template and caching engine. launches:
picurls - picture buzz, buzziest pics on the net

Database design

I chose SQLite database for this project because of its simplicity - the whole database is a single file and its excellent performance and memory footprint. The only thing I am concerned with is concurrency issues.

SQLite FAQ says:

SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business.

However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.

If picurls gets really popular people might have start getting database errors when posting comments or updating their profiles. I certainly do not want to create such a negative user experience.

The database uses the most simple SQL constructs and thus can can easily be moved to a client/server database engine if something bad happens.

Here is how the database scheme of the first version of picurls looks:

  title      STRING  NOT NULL,
  sane_title STRING  NOT NULL,
  url        STRING  NOT NULL,
  thumb      STRING  NOT NULL,
  site_id    INTEGER NOT NULL,
  date_added DATE    NOT NULL,
  visible    BOOL    NOT NULL DEFAULT 1

CREATE TABLE tmp_items (
  title      STRING  NOT NULL,
  url        STRING  NOT NULL,
  date_added DATE    NOT NULL,
  site_id    INTEGER NOT NULL,

CREATE TABLE comments (
  comment        STRING  NOT NULL,
  item_id        INTEGER NOT NULL,
  user_id        STRING  NOT NULL,
  anonymous_name STRING,
  ip_address     STRING  NOT NULL,
  date_added     DATE    NOT NULL

  visible   BOOL    NOT NULL DEFAULT 1,
  priority  INTEGER NOT NULL

  password    STRING NOT NULL,
  data        STRING,
  ip_address  STRING NOT NULL,
  date_regged DATE   NOT NULL,
  date_access DATE   NOT NULL,
  can_login   BOOL   NOT NULL DEFAULT 1

CREATE INDEX IDX_sites_sane_name       on sites(sane_name);
CREATE INDEX IDX_sites_priority        on sites(priority);
CREATE INDEX IDX_items_site_id         on items(site_id);
CREATE INDEX IDX_items_date_added      on items(date_added);
CREATE INDEX IDX_items_sane_title      on items(sane_title);
CREATE INDEX IDX_comments_item_id      on comments(item_id);
CREATE INDEX IDX_comments_user_id      on comments(user_id);
CREATE INDEX IDX_comments_date_added   on comments(date_added);
CREATE INDEX IDX_comments_item_user_ip on comments(item_id, user_id, ip_address);
CREATE INDEX IDX_users_username        on users(username);

INSERT INTO sites (name, sane_name, url, priority) VALUES('Digg',        'digg',        '',        1);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Reddit',      'reddit',      '',          2);
INSERT INTO sites (name, sane_name, url, priority) VALUES('', 'delicious',   '',         3);
INSERT INTO sites (name, sane_name, url, priority) VALUES('StumbleUpon', 'stumbleupon', '', 4);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Flickr',      'flickr',      '',      5);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Simpy',       'simpy',       '',       6);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Furl',        'furl',        '',        7);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Boing Boing', 'boingboing',  '',  8);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Wired',       'wired',       '',       9);

INSERT INTO users (id, username, password, ip_address, date_regged, date_access, can_login) VALUES (0, 'anonymous', 'x', '', '1970-01-01 00:00:00', '1970-01-01 00:00:00', 0);

As I mentioned in part one I want the whole project to be reusable in the future (I already have an idea what I will fork from this project).

You might have noticed that the database schema almost does not contain fields specific to picurls (except 'thumb' field in items and tmp_items tables).

Here is a very brief description of the tables:

  • items - contains links to pictures to be displayed on the front page of picurls.
  • tmp_itmes - contains links to possible pictures which the scraper (see part one of this article) found on social bookmarking/social news sites.
  • omments - contains user comments.
  • sites - contains information about sites picurls is collecting pictures from.
  • users - contains registered user infromation.

If SQLite becomes unsuitable for picurls at some point, I can just dump the database and use almost the same database schema (maybe changing a few field types) in MySQL or PostgreSQL.

User Interface Design

I chose PHP programming language as the server side language for the user interface of picurls. One of the reasons is that it is one of the most popular programming language and as I am releasing the full source code of picurls, I expect someone to help me with adding features or just spotting bugs :)

The logic behind handling requests of the user interface works similar to the framework. First we define the URL structure, and specify which scripts will handle which request URLs.

Here is an example of picurl's URL structure:

$pages = Array(
    '#^/(?:index\.(?:html|php))?$#'  => 'page-index.php',       # main page handler
    '#^/site/(\w+)(?:-(\d+))?.html#' => 'page-site.php',        # site handler (digg, reddit etc)
    '#^/item/([a-z0-9-]+).html#'     => 'page-item.php',        # single item handler
    '#^/login.html#'                 => 'page-login.php',       # login page handler
    '#^/register.html#'              => 'page-register.php',    # registration page handler
    '#^/logout.html#'                => 'page-logout.php',      # logout page handler
    '#^/my-comments(?:-(\d+))?.html#'=> 'page-my-comments.php', # my comment page handler
    '#^/my-profile.html#'            => 'page-my-profile.php'   # my profile page handler

For example, a request to '' would get handled by 'page-site.php' script. The value '3' (page number) would get saved, so the page-site.php knew which page number got requested.

Each request to the server gets handled by the default webserver index file - index.php. To have it this way I set up mod_rewrite to rewrite URLs to index.php:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

It is very important to have human readable URLs, such as '' and not something like ''. This way the website will rank better on search engines and people will easier find what they were looking for.

I noticed that people coming from search engines to digpicz and redditmedia have most often searched for a set of keywords they remembered from the picture.

Here is the index.php script that handles the URL requests and dispatches them to appropriate scripts:
See url structure handler script (index.php) (downloaded: 7951 times)

And here is the page-site.php script which handles requests for pictures from particular site (such as digg or stumbleupon):
See site handler script (page-site.php) (downloaded: 5474 times)

The actual contents of the pages get served by Smarty templating and caching framework. It is a good practice to separate application logic and content from its presentation. Read the about Smarty page for an example why it's a good practice if you have not done it before.

As I went with a dynamic (PHP) solution for serving contents and I expect the website to become quite popular and I am on a server with limited resources, I needed to find a good and fast way to display contents of the website. Smarty has exactly what I am looking for - support for caching.

The first version of does not do caching dynamically (based on content change), instead it just caches pages for some constant time and then flushes cache until the next request.

That's about it. :)

Please ask in the comments if you want a more detailed explanation of some software components!

Download Picurls Website Source Code

All the scripts in a single .zip:
Download link: full source code
Downloaded: 1173 times

The source code is released under GNU General Public License.

If you use the source and create your own picurls-like website you must link back to my (this) blog and you must link back to!

The .zip archive contains several subdirectories:

  • cache - directory where Smarty keeps cached pages.
  • db - directory where the SQLite database is kept. I included a sample database with 90 pictures - 10 pictures from each of the sites takes contents from (digg, reddit, delicious, flickr, stumbleupon, simpy, furl, wired and boing boing).
  • locks - directory where scripts hold their lockfiles to ensure single copy of scraper/thumbnail generator scripts are running at any given time
  • scraper - website data-miner/scraper program (see part one of this article for more information).
  • scripts - scripts which call scraper, insert data in the database and generate thumbnails.
  • templates - HTML templates for Smarty.
  • templates_c - directory where Smarty keeps compiled templates.
  • www - main website directory, containing all the PHP scripts, CSS, images and thumbnails (around 90 thumbnails for each item in the sample database).

To get your own picurls website running, you will have to configure config.php file in www directory where you need to specify full path to the sqlite database (in db directory). That's just the user interface, though. No new items will ever be retrieved from websites because you have to submit the scraper up. The instructions would take another article. If you want to try, though, look at shell script in scripts directory. Running this script periodically will scrape websites for new posts which look like images posts and try to insert them in the database (you will have to change some constants in script). After this script has run, script has to be run (also needs constants changed).


To have a peak productivity at working with command line most often you need multiple terminals open at a given time. One where you have your text editor session open, one where you read documentation, the third where you test out program snippets, etc. You might be pretty productive on a terminal emulator which can have multiple terminals open in a single window but what if you are connected to a distant server with a telnet/ssh client such as Putty? Would you run multiple Putty sessions to do what you have desired? What if your connection breaks loose? Would you reconnect and start all over again? Definitely not!

One of the solutions is to use an advanced terminal emulator such as screen.

What is screen (from GNU screen manual)?

Screen is a full-screen window manager that multiplexes a physical terminal between several processes, typically interactive shells. There is a scrollback history buffer for each virtual terminal and a copy-and-paste mechanism that allows the user to move text regions between windows. When screen is called, it creates a single window with a shell in it (or the specified command) and then gets out of your way so that you can use the program as you normally would. Then, at any time, you can create new (full-screen) windows with other programs in them (including more shells), kill the current window, view a list of the active windows, turn output logging on and off, copy text between windows, view the scrollback history, switch between windows, etc. All windows run their programs completely independent of each other. Programs continue to run when their window is currently not visible and even when the whole screen session is detached from the users terminal. Each virtual terminal provides the functions of the DEC VT100 terminal and, in addition, several control functions from the ANSI X3.64 (ISO 6429) and ISO 2022 standards (e.g., insert/delete line and support for multiple character sets).

When I first found screen, I didn't know many of its features. I knew I could create new terminals by pressing the default CONTROL-a-c key sequence and switch between them using the CONTROL-a-a sequence, that it would "save" my windows if I got disconnected and I could get them back by typing "screen -r" after I connected back to the computer. Not much else. I clearly wasn't as productive as I could have. I set out to explore the whole screen program.

To do that I used my cheat-sheet approach. As I have written before this approach is to have a printed cheat sheet in front of me as I learn new commands. This way each time I look for a command I can scan over the other commands and I can remember them better subconsciously.

This cheat sheet summarizes all they default keyboard mappings, with screen's commands to execute the mapping and a description of each mapping.

I had made this cheat sheet in the year 2001 and I had lost the file so I recreated it in LaTeX, made a PDF and converted it to .txt formats.

Actually I hadn't used LaTeX much and had difficulties to get it right the first time. I searched my documents collection and found The Not So Short Introduction to LaTeX book. Really nice reading. I recommend it!

Here is an example screenshot of how I use screen effectively:

screen - terminal emulator - split windows, named tabs, date, load

For more cool screen screenshots, Google search Google Images for 'gnu screen'.

Here is the cheat sheet itself:

Download Screen Cheat Sheet

PDF format (.pdf):
Download link: screen cheat sheet (.pdf)
Downloaded: 165039 times

ASCII .txt format:
Download link: screen cheat sheet (.txt)
Downloaded: 23399 times

LaTeX format (.tex):
Download link: screen cheat sheet (.tex)
Downloaded: 6840 times

Have fun becoming more efficient with screen!

This article is part of the article series "Creating"
<- previous article next article ->

The making of, the Popurls for Pictures: buzziest picsThis is part one of a two part series on how I am going to develop, a like website for the buzziest pics on the net.

As you remember, a few weeks ago I created (now moved to and These sites were fun to make. I love to check them daily and people love them as well!

While I was creating the second site, digpicz, it struck me - why not create a single site similar to popurls which aggregates posts to pictures from many sources?

Let's get hands on it!

picurls - picture buzz, buzziest pics on the net

Technical Design of Picurls

Technically the website is very easy to create. It will use a bunch of Perl programs which periodically check for new posts on various social news and bookmarking sites, insert these posts into the database and create a small picture thumbnails for each post.

This project can now reuse components from diggpics and redditmedia. Digpicz and reddit media each had their own post/story extractor which scraped the content from these sites.

If you looked closely at these scripts, you would notice that a lot of code is almost identical in both of them:

Each site also used the same set of scripts for finding the "best" picture on the website, retrieving it and creating a thumbnail out of this pic. These scripts will also get reused.

The first version of the Picurls website is going to scrape contents from 9 sources:

I really don't want to duplicate the same code over and over again, creating an extractor program for each of the sites. I better write a more generic plugin-based data scraper which can easily be extended and reused (read more about it below). These digg and reddit extractors can now be turned into scraper plugins easily.

The data scraper program will output posts in a human readable format ready for input to a program which inserts the posts into a temporary picture database.

Once the posts are in the database, another program will run and give a few tries to extract a thumbnail for the post. If it succeeds it will move the post from the temporary picture table to real picture table.

The user interface of the site will be done in PHP and will use data from the real picture table. To minimize load on the server, the website pages will be cached. Designing a primitive caching system in PHP is very simple, but there are strong libraries already which do it for us. One of them is Smarty. Smarty is a template engine for PHP. More specifically, it facilitates a manageable way to separate application logic and content from its presentation. Not only that, it can easily be instructed to cache the pages.

Digg pics and redditmedia were completely pre-generated by a Perl program and the web server had to serve just regular HTML files. The sites had less functionality but they handled traffic well. It will be almost the same with picurls, where the Smarty will serve the cached HTML contents and only when a new picture of a new comment is added will the cache be flushed. This way I do not have to worry about server getting too loaded.

Using PHP as the server side language for the user interface of the website will make it a much more lively than redditmedia or digpicz. I want the new site to have comments, most viewed pics, most commented pics, search functionality, voting and more features later.

The first version of picurls will have comments but the other features will come later. Also I want to write it in a manner that the programs can be reused in other projects.

Designing the Generic Plugin-Based Data Scraper

The basic idea of the data scraper is to crawl websites and to extract the posts in a human readable output format. I want it to be easily extensible via plugins and be highly reusable. Also I want the scraper to have basic filtering capabilities to select just the posts which I am interested in.

There are two parts to the scraper - the scraper library and the scraper program which uses the library and makes it easier to scrape many sites at once.

The scraper library consists of the base class 'sites::scraper' and plugins for many various websites. For example, Digg's scraper plugin is 'sites::digg' (it inherits from sites::scraper).

The constructor of each plugin takes 4 optional arguments - pages, vars, patterns or pattern_file.

  • pages - integer, specifies how many pages to scrape in a single run,
  • vars - hashref, specifies parameters for the plugin,
  • patterns - hashref, specifies string regex patterns for filtering posts,
  • pattern_file - string, path to file containing patterns for filtering posts.

Here is a Perl one-liner example of scraper library usage (without scraper program). This example scrapes 2 most popular pages of stories from Digg's programming section, filtering just the posts matching 'php' (case insensitive):

perl -Msites::digg -we '$digg = sites::digg->new(pages => 2, patterns => { title => [ q/php/ ], desc => [ q/php/ ] }, vars => { popular => 1, topic => q/programming/ }); $digg->scrape_verbose'

In this example we scrape two pages of all popular digg posts (made to front page) in programming category (topic) which have 'php' in either title or description of the post.

Here is the output of the plugin:

comments: 27
container_name: Technology
container_short_name: technology
description: With WordPress 2.3 launching this week, a bunch of themes and plugins needed updating. If you're not that familiar with <strong>PHP</strong>, this might present a slight problem. Not to worry, though - we've collected together 20+ tools for you to discover the secrets of <strong>PHP</strong>.
human_time: 2007-09-26 18:18:02
id: 3587383
score: 921
status: popular
title: The <strong>PHP</strong> Toolbox: 20+ <strong>PHP</strong> Resources
topic_name: Programming
topic_short_name: programming
unix_time: 1190819882
user: ace77
user_profileviews: 17019
user_registrered: 1162332420
site: digg

comments: 171
container_name: Technology
container_short_name: technology
description: "Back in January 2005, I announced on the O'Reilly blog that I was going to completely scrap over 100,000 lines of messy <strong>PHP</strong> code in my existing CD Baby ( website, and rewrite the entire thing in Rails, from scratch."Great article.
human_time: 2007-09-23 06:47:38
id: 3548227
score: 1653
status: popular
title: 7 Reasons I Switched Back to <strong>PHP</strong> After 2 Years on Rails
topic_name: Programming
topic_short_name: programming
unix_time: 1190519258
user: Steaminx
user_profileviews: 14083
user_registrered: 1104849214
site: digg

Each story is represented as a paragraph of key: value pairs. In this case the scraper found 2 posts matching PHP.

Any program taking this output as input is free to choose parts of information they want to use.

It is guaranteed that each plugin produces output with at least 'title', 'url' and 'site' fields.

The date of the post is extracted, if available, is extracted by two fields 'unix_time' and 'human_time'.

In Picurls case, I am interested in 'title', 'url', 'unix_time' and 'site' fields.

To create a plugin, one must override just three methods from the base class:

  • site_name - method should return a unique site id which will be output in each post as 'site' field,
  • get_page_url - given a page number, the method should construct a URL to the page containing posts,
  • get_posts - given the content of the page located at last get_page_url call, the subroutine should return an array of hashrefs containing key => val pairs containing the post information

It's very difficult to document everything the library does. It would take a few pages of documentation to document this simple library. If you are more interested in it, please take a look at the sources.

Take a look at website scraper plugin to see how trivial it is to write a new plugin:
See generic plugin-based scraper ( scraper plugin) (downloaded: 5199 times)

Here is the base class of scraper library
See generic plugin-based scraper (base class) (downloaded: 4138 times)

The scraper program takes a bunch of command line arguments and calls each plugin in turn generating huge amount of output.

The program is called Running it without arguments prints its basic usage:

Usage: ./ <<strong>site</strong>[:<strong>M</strong>][:{<strong>var1=val1; var2=val2 ...</strong>}]> ... [/path/to/pattern_file]
Crawls given sites extracting entries matching optional patterns in pattern_file
Optional argument M specifies how many pages to crawl, default 1
Arguments (variables) for plugins can be passed via an optional { }

The arguments in { } get parsed and then get passed to constructor of site. Also a number of sites can be scraped at once.

For example, running the program with the following arguments:

./ reddit:2:{subreddit=science} stumbleupon:{tag=photography} picurls.txt

Would scrape two pages of and a page of StumbleUpon website tagged 'photography' and use filtering rules in the file 'picurls.txt'.

This is how the output of this program looks:

desc: Morning Glory at rest before another eruption, Yellow Stone National Park.
human_time: 2007-02-14 04:34:41
unix_time: 1171420481
site: stumbleupon

desc: Time for yur Bath
human_time: 2007-10-10 04:34:41
unix_time: 1191980081
site: stumbleupon

human_time: 2007-10-13 15:34:42
id: 2zq0v
score: 4
title: Sharpest image of Pluto ever taken
unix_time: 1192278882
user: clawoo
site: reddit

Here is the program itself:
See generic plugin-based scraper (scraper program) (downloaded: 3698 times)

Here is how the filter file picurls.txt for Picurls looks like:

# match picture urls
url: \.jpg$
url: \.gif$
url: \.png$

# match common patterns describing posts having pictures in them

photos? of
pics? of
images? of
pictures? of
comics? of
charts? of
graphs? of
grapics? of
(this|these|those) photos?
(this|these|those) pics?
(this|these|those) images?
photosets? (on|of)

# match domains containing just pics
url: xkcd\.com
url: flickr\.com
url: photobucket\.com
url: imageshack\.us
url: bestpicever\.com

The format of the file is the following:
[url: |title: |desc: ]regex_pattern

url:, title: and desc: are optional. They specify if the entry on a website should be matched against its url, title or description.

If neither url:, title: and desc: are specified, it defaults to matching pattern against title and description.

Update: I added another format field 'perl:' which allows to write a filter predicate as an anonymous Perl subroutine which gets called on each item scraped. A predicate is a function which returns a true or false for a given input. In our case, if the predicate returns true, the item gets accepted (and printed as output later), if false, then the next predicate (if any) is considered and the same rule applies.

Here is an example of filter file which defines a single predicate, which filters out most of the URLs which point to root location of the site (not likely to contain an interesting picture):

# Discard items which point to index pages
perl: sub {
    use URI;
    my $post = shift;

    my $uri = URI->new($post->{url});
    my $path = $uri->path;

    if (!length $path) { # empty path
        return 0;
    elsif ($path =~ m!^/+$!) { # just a slash '/'
        return 0;
    elsif ($path =~ m!^/(home|index)\.(php|html|htm|aspx?)$!i) { # some index files
        return 0;

    return 1;

And here is the whole scraper package with the scraper library, 9 plugins and the scraper program:

Download Generic Plugin-Based Website Scraper

All the scripts in a single .zip:
Download link: generic plugin-based scraper (whole package)
Downloaded: 716 times

The second part of the article will discuss the database design and the user interface of

Update: Click here to view Part II of the article.

Until next time!