To have a peak productivity at working with command line most often you need multiple terminals open at a given time. One where you have your text editor session open, one where you read documentation, the third where you test out program snippets, etc. You might be pretty productive on a terminal emulator which can have multiple terminals open in a single window but what if you are connected to a distant server with a telnet/ssh client such as Putty? Would you run multiple Putty sessions to do what you have desired? What if your connection breaks loose? Would you reconnect and start all over again? Definitely not!

One of the solutions is to use an advanced terminal emulator such as screen.

What is screen (from GNU screen manual)?

Screen is a full-screen window manager that multiplexes a physical terminal between several processes, typically interactive shells. There is a scrollback history buffer for each virtual terminal and a copy-and-paste mechanism that allows the user to move text regions between windows. When screen is called, it creates a single window with a shell in it (or the specified command) and then gets out of your way so that you can use the program as you normally would. Then, at any time, you can create new (full-screen) windows with other programs in them (including more shells), kill the current window, view a list of the active windows, turn output logging on and off, copy text between windows, view the scrollback history, switch between windows, etc. All windows run their programs completely independent of each other. Programs continue to run when their window is currently not visible and even when the whole screen session is detached from the users terminal. Each virtual terminal provides the functions of the DEC VT100 terminal and, in addition, several control functions from the ANSI X3.64 (ISO 6429) and ISO 2022 standards (e.g., insert/delete line and support for multiple character sets).

When I first found screen, I didn't know many of its features. I knew I could create new terminals by pressing the default CONTROL-a-c key sequence and switch between them using the CONTROL-a-a sequence, that it would "save" my windows if I got disconnected and I could get them back by typing "screen -r" after I connected back to the computer. Not much else. I clearly wasn't as productive as I could have. I set out to explore the whole screen program.

To do that I used my cheat-sheet approach. As I have written before this approach is to have a printed cheat sheet in front of me as I learn new commands. This way each time I look for a command I can scan over the other commands and I can remember them better subconsciously.

This cheat sheet summarizes all they default keyboard mappings, with screen's commands to execute the mapping and a description of each mapping.

I had made this cheat sheet in the year 2001 and I had lost the file so I recreated it in LaTeX, made a PDF and converted it to .txt formats.

Actually I hadn't used LaTeX much and had difficulties to get it right the first time. I searched my documents collection and found The Not So Short Introduction to LaTeX book. Really nice reading. I recommend it!

Here is an example screenshot of how I use screen effectively:

screen - terminal emulator - split windows, named tabs, date, load

For more cool screen screenshots, Google search Google Images for 'gnu screen'.

Here is the cheat sheet itself:

Download Screen Cheat Sheet

PDF format (.pdf):
Download link: screen cheat sheet (.pdf)
Downloaded: 120493 times

ASCII .txt format:
Download link: screen cheat sheet (.txt)
Downloaded: 19595 times

LaTeX format (.tex):
Download link: screen cheat sheet (.tex)
Downloaded: 3855 times

Have fun becoming more efficient with screen!

This article is part of the article series "Creating picurls.com."
<- previous article next article ->

The making of picurls.com, the Popurls for Pictures: buzziest picsThis is part one of a two part series on how I am going to develop picurls.com, a popurls.com like website for the buzziest pics on the net.

As you remember, a few weeks ago I created redditmedia.com (now moved to http://reddit.picurls.com) and digpicz.com. These sites were fun to make. I love to check them daily and people love them as well!

While I was creating the second site, digpicz, it struck me - why not create a single site similar to popurls which aggregates posts to pictures from many sources?

Let's get hands on it!

Update! Picurls.com launches:
picurls - picture buzz, buzziest pics on the net

ps. I am terribly sorry for the delay between this post and the previous one. The last year at university has begun for me and I am pushing it to the extremes with number of subjects a person can take. I'm taking 4 additional math courses to the 5 required physics courses (total 9 courses this term). I'm extremely busy and tired in the evenings after I get back home. The future posts might also not be as regular as before...

Technical Design of Picurls

Technically the website is very easy to create. It will use a bunch of Perl programs which periodically check for new posts on various social news and bookmarking sites, insert these posts into the database and create a small picture thumbnails for each post.

This project can now reuse components from diggpics and redditmedia. Digpicz and reddit media each had their own post/story extractor which scraped the content from these sites.

If you looked closely at these scripts, you would notice that a lot of code is almost identical in both of them:

Each site also used the same set of scripts for finding the "best" picture on the website, retrieving it and creating a thumbnail out of this pic. These scripts will also get reused.

The first version of the Picurls website is going to scrape contents from 9 sources:

I really don't want to duplicate the same code over and over again, creating an extractor program for each of the sites. I better write a more generic plugin-based data scraper which can easily be extended and reused (read more about it below). These digg and reddit extractors can now be turned into scraper plugins easily.

The data scraper program will output posts in a human readable format ready for input to a program which inserts the posts into a temporary picture database.

Once the posts are in the database, another program will run and give a few tries to extract a thumbnail for the post. If it succeeds it will move the post from the temporary picture table to real picture table.

The user interface of the site will be done in PHP and will use data from the real picture table. To minimize load on the server, the website pages will be cached. Designing a primitive caching system in PHP is very simple, but there are strong libraries already which do it for us. One of them is Smarty. Smarty is a template engine for PHP. More specifically, it facilitates a manageable way to separate application logic and content from its presentation. Not only that, it can easily be instructed to cache the pages.

Digg pics and redditmedia were completely pre-generated by a Perl program and the web server had to serve just regular HTML files. The sites had less functionality but they handled traffic well. It will be almost the same with picurls, where the Smarty will serve the cached HTML contents and only when a new picture of a new comment is added will the cache be flushed. This way I do not have to worry about server getting too loaded.

Using PHP as the server side language for the user interface of the website will make it a much more lively than redditmedia or digpicz. I want the new site to have comments, most viewed pics, most commented pics, search functionality, voting and more features later.

The first version of picurls will have comments but the other features will come later. Also I want to write it in a manner that the programs can be reused in other projects.

Designing the Generic Plugin-Based Data Scraper

The basic idea of the data scraper is to crawl websites and to extract the posts in a human readable output format. I want it to be easily extensible via plugins and be highly reusable. Also I want the scraper to have basic filtering capabilities to select just the posts which I am interested in.

There are two parts to the scraper - the scraper library and the scraper program which uses the library and makes it easier to scrape many sites at once.

The scraper library consists of the base class 'sites::scraper' and plugins for many various websites. For example, Digg's scraper plugin is 'sites::digg' (it inherits from sites::scraper).

The constructor of each plugin takes 4 optional arguments - pages, vars, patterns or pattern_file.

  • pages - integer, specifies how many pages to scrape in a single run,
  • vars - hashref, specifies parameters for the plugin,
  • patterns - hashref, specifies string regex patterns for filtering posts,
  • pattern_file - string, path to file containing patterns for filtering posts.

Here is a Perl one-liner example of scraper library usage (without scraper program). This example scrapes 2 most popular pages of stories from Digg's programming section, filtering just the posts matching 'php' (case insensitive):

perl -Msites::digg -we '$digg = sites::digg->new(pages => 2, patterns => { title => [ q/php/ ], desc => [ q/php/ ] }, vars => { popular => 1, topic => q/programming/ }); $digg->scrape_verbose'

In this example we scrape two pages of all popular digg posts (made to front page) in programming category (topic) which have 'php' in either title or description of the post.

Here is the output of the plugin:

comments: 27
container_name: Technology
container_short_name: technology
description: With WordPress 2.3 launching this week, a bunch of themes and plugins needed updating. If you're not that familiar with <strong>PHP</strong>, this might present a slight problem. Not to worry, though - we've collected together 20+ tools for you to discover the secrets of <strong>PHP</strong>.
human_time: 2007-09-26 18:18:02
id: 3587383
score: 921
status: popular
title: The <strong>PHP</strong> Toolbox: 20+ <strong>PHP</strong> Resources
topic_name: Programming
topic_short_name: programming
unix_time: 1190819882
url: http://mashable.com/2007/09/26/php-toolbox/
user: ace77
user_icon: http://digg.com/users/ace77/l.png
user_profileviews: 17019
user_registrered: 1162332420
site: digg

comments: 171
container_name: Technology
container_short_name: technology
description: "Back in January 2005, I announced on the O'Reilly blog that I was going to completely scrap over 100,000 lines of messy <strong>PHP</strong> code in my existing CD Baby (cdbaby.com) website, and rewrite the entire thing in Rails, from scratch."Great article.
human_time: 2007-09-23 06:47:38
id: 3548227
score: 1653
status: popular
title: 7 Reasons I Switched Back to <strong>PHP</strong> After 2 Years on Rails
topic_name: Programming
topic_short_name: programming
unix_time: 1190519258
url: http://www.oreillynet.com/ruby/blog/2007/09/7_reasons_i_switched_back_to_p_1.html
user: Steaminx
user_icon: http://digg.com/users/Steaminx/l.png
user_profileviews: 14083
user_registrered: 1104849214
site: digg

Each story is represented as a paragraph of key: value pairs. In this case the scraper found 2 posts matching PHP.

Any program taking this output as input is free to choose parts of information they want to use.

It is guaranteed that each plugin produces output with at least 'title', 'url' and 'site' fields.

The date of the post is extracted, if available, is extracted by two fields 'unix_time' and 'human_time'.

In Picurls case, I am interested in 'title', 'url', 'unix_time' and 'site' fields.

To create a plugin, one must override just three methods from the base class:

  • site_name - method should return a unique site id which will be output in each post as 'site' field,
  • get_page_url - given a page number, the method should construct a URL to the page containing posts,
  • get_posts - given the content of the page located at last get_page_url call, the subroutine should return an array of hashrefs containing key => val pairs containing the post information

It's very difficult to document everything the library does. It would take a few pages of documentation to document this simple library. If you are more interested in it, please take a look at the sources.

Take a look at digg.com website scraper plugin digg.pm to see how trivial it is to write a new plugin:
See generic plugin-based scraper (digg.com scraper plugin) (downloaded: 2897 times)

Here is the base class of scraper library scraper.pm:
See generic plugin-based scraper (base class) (downloaded: 1847 times)

The scraper program takes a bunch of command line arguments and calls each plugin in turn generating huge amount of output.

The program is called scraper.pl. Running it without arguments prints its basic usage:

Usage: ./scraper.pl <<strong>site</strong>[:<strong>M</strong>][:{<strong>var1=val1; var2=val2 ...</strong>}]> ... [/path/to/pattern_file]
Crawls given sites extracting entries matching optional patterns in pattern_file
Optional argument M specifies how many pages to crawl, default 1
Arguments (variables) for plugins can be passed via an optional { }

The arguments in { } get parsed and then get passed to constructor of site. Also a number of sites can be scraped at once.

For example, running the program with the following arguments:

./scraper.pl reddit:2:{subreddit=science} stumbleupon:{tag=photography} picurls.txt

Would scrape two pages of science.reddit.com and a page of StumbleUpon website tagged 'photography' and use filtering rules in the file 'picurls.txt'.

This is how the output of this program looks:

desc: Morning Glory at rest before another eruption, Yellow Stone National Park.
human_time: 2007-02-14 04:34:41
title: public-domain-photos.com/free-stock-photos-4/travel/yellowstone/m...
unix_time: 1171420481
url: http://www.public-domain-photos.com/free-stock-photos-4/travel/yellowstone/morning-glory-pool.jpg
site: stumbleupon

desc: Time for yur Bath
human_time: 2007-10-10 04:34:41
title: woostercollective.com/2007/07/16/giantduck1.jpg
unix_time: 1191980081
url: http://www.woostercollective.com/2007/07/16/giantduck1.jpg
site: stumbleupon

human_time: 2007-10-13 15:34:42
id: 2zq0v
score: 4
title: Sharpest image of Pluto ever taken
unix_time: 1192278882
url: http://www.badastronomy.com/bablog/2007/10/12/sharpest-image-of-pluto-ever-taken/
user: clawoo
site: reddit

Here is the scraper.pl program itself:
See generic plugin-based scraper (scraper program) (downloaded: 1857 times)

Here is how the filter file picurls.txt for Picurls looks like:

# match picture urls
#
url: \.jpg$
url: \.gif$
url: \.png$

# match common patterns describing posts having pictures in them
#
[[(].*picture.*[])]
[[(].*pic.*[])]
[[(].*image.*[])]
[[(].*photo.*[])]
[[(].*comic.*[])]
[[(].*chart.*[])]
[[(].*graph.*[])]

photos? of
pics? of
images? of
pictures? of
comics? of
charts? of
graphs? of
grapics? of
(this|these|those) photos?
(this|these|those) pics?
(this|these|those) images?
photosets? (on|of)

# match domains containing just pics
url: xkcd\.com
url: flickr\.com
url: photobucket\.com
url: imageshack\.us
url: bestpicever\.com

The format of the file is the following:
[url: |title: |desc: ]regex_pattern

url:, title: and desc: are optional. They specify if the entry on a website should be matched against its url, title or description.

If neither url:, title: and desc: are specified, it defaults to matching pattern against title and description.

Update: I added another format field 'perl:' which allows to write a filter predicate as an anonymous Perl subroutine which gets called on each item scraped. A predicate is a function which returns a true or false for a given input. In our case, if the predicate returns true, the item gets accepted (and printed as output later), if false, then the next predicate (if any) is considered and the same rule applies.

Here is an example of filter file which defines a single predicate, which filters out most of the URLs which point to root location of the site (not likely to contain an interesting picture):

# Discard items which point to index pages
#
perl: sub {
    use URI;
    my $post = shift;

    my $uri = URI->new($post->{url});
    my $path = $uri->path;

    if (!length $path) { # empty path
        return 0;
    }
    elsif ($path =~ m!^/+$!) { # just a slash '/'
        return 0;
    }
    elsif ($path =~ m!^/(home|index)\.(php|html|htm|aspx?)$!i) { # some index files
        return 0;
    }

    return 1;
}

And here is the whole scraper package with the scraper library, 9 plugins and the scraper program:

Download Generic Plugin-Based Website Scraper

All the scripts in a single .zip:
Download link: generic plugin-based scraper (whole package)
Downloaded: 484 times

The second part of the article will discuss the database design and the user interface of picurls.com.

Update: Click here to view Part II of the article.

Hosting

ZigZap Technologies have kindly provided me with a server hosting for all my needs. If you are looking for hosting solutions, they are excellent!

Until next time!

peteris krumins interviewYesterday I was interviewed by Muhammad Saleem. We discussed my programming background, why I created RedditMedia and DigPicz, where do I go from now, and would I ever sell digg pics to Digg!

Read the whole interview with me.

I copied it here in case Muhammad's website ever goes down:

peter is 22 years old. he is studying physics and loves mathematics but his real passion is computer science. he has recently become famous for developing reddit media and digpicz.

hi peter. thanks for taking out the time to answer some questions.

let's start with some background.

sure. i got into programming around 1996. it was hard to get internet access here in latvia at that time so i befriended a unix sysadmin who was like my teacher at that time. anyway, from there, i met this friend on irc, who helped me with various computer-related things and introduced me to linux and programming.

my first experience with programming was creating an irc client. it was fun exploring this protocol. since then i have been programming both for fun and work in many various languages. at one point i even wrote an intrusion prevention system when i worked as an white-hat (ethical) hacker.

so how did you end up creating reddit media?

i have been an active reddit user for a while, since the content there is relatively better than other social sites. when they added the programming subreddit, it became my number one source of information. as for why i created reddit media, it was mostly so that people would have an easy way to access all the media (pictures and videos) that is submitted to reddit.

it was so easy to create it, that i just did it.

and how about diggpicz?

well, as i'm sure you know, digg users have been asking for a pictures section on the site for a long time. i saw that the digg developers kept saying that they would create the section but haven't done it yet so i decided to do it for digg users. i have had so much programming experiences that this is easy for me and allows me to help such a large community with relatively little effort on my part.

it was just a few hours of work to reuse code that i had written for reddit media to create digg picz.

so where do you go from here? it seems that digg's official pictures section is coming. is there any chance that they will seek help from you or would you consider working for/with them?

well, even when digg releases their own version, i won't feel that my efforts are wasted. creating the site was fun and i enjoyed helping thousands of users in the meantime. again, the site took only 7 hours to make. that said, i doubt that they will use my site, because i wrote it entirely in perl and optimized it for a shared hosting server, meaning that all the pages are pre-generated and there is no interaction with database in real time. and digg runs php, so it would be a complete re-write. my intention was never 'build-to-flip'

working for digg would be cool, but there is a problem with that. i am last year theoretical physics student and i graduate only in june 2008. i am applying for mit this november and if i get accepted then i will put all the effort in studying there and not working. if i don't get in, then i will definitely consider working for digg.

now that you have created a version of the site for reddit and a version for digg and seen how successful both of them have become, have you thought about releasing a pligg-like cms to let people create their own media sites?

yes i have thought about it, but first i would like to finish picurls.com (same idea as popurls.com but focused on pictures) which will be in perl again. in an upcoming version of picurls.com you will be able to choose from a variety of sources, which ones you want to see pictures from.

hopefully this site will launch by this sunday.

and you are funding all these projects out of your own pocket?

yes, i am funding all of this -literally- out of my own pocket, hence i am looking for someone who would sponsor me a dedicated linux server.

thanks for your time and i hope somebody will generously gift you a linux server.

digpicz digg’s missing picture sectionI am completely amazed by the amount of traffic and emails I have received after launching digpicz.com.

Digpicz received an incredible amount of 100'000 visitors during the first 20 hours! I launched it at 3am (my local time) and it got submitted to digg.com by Ilker Yoldas from TheThinkingBlog soon after, and it made to the front page of Digg two hours later.

According to statcounter.com by midnight it had been visited by 100'727 unique visitors. During two days it has been online (September 2 and September 3) 130'848 uniques have made 265'836 page requests.

digpicz, digg’s missing picture section, statcounter traffic stats

Since I was running it on a shared hosting server ($9.95/monthly at Dreamhost), I had foreseen that there might be problems with dynamically generating page on each request (for example, taking data from the database and creating HTML output using PHP or Perl) and all the pages on the website were simple pre-generated HTML documents. Once a new picture was added only the index.html (default page) was regenerated.

Using this technique helped a shared server with 1000 users (wc -l /etc/passwd) survive!

Soon after digpicz.com got posted to Digg, the story was picked up by TechCrunch, Mashable, FranticIndustries (which was first to blog about it :)) and many other sites.

Here is a picture from Google Analytics web statistics displaying top referrals (September 2 data only):

digpicz, digg’s missing picture section, google analytics referral statistics

It was not that nice with this blog. Since I posted how the site was made and made full source code of digpicz.com website generator available, the blog entry got posted to Digg itself as well.

I use Wordpress blog platform for this blog and it is not really designed with handling slashdot like effects (digg effect, for example) in mind. If no optimizations are done, traffic spike usually just bring the server to down to hell. With all the plugins I use on this blog, Wordpress makes like 40 SQL queries for each served page and then runs the whole content through filters and hooks and whatnot.

I did not expect the site to get Dugg that soon and had not made any optimizations to the blog.

The post made it to Digg's front page in 7 hours and the traffic brought the whole shared hosting server to knees. The blog stopped responding and I had to do something because I was losing many, many potential new readers and subscribers.

I tried installing Wp-Cache 2 plugin on my local development server with an idea to upload it via FTP to shared hosting but the plugin didn't want to work. I have no idea why and I didn't have time to debug the problem. I had to find another solution.

I had set up my blog to use permalink link structure and I remembered that the mod_rewrite directives (ps. here is an excellent mod_rewrite cheat sheet) in .htaccess file, which sets up the link structure, avoided rewriting links if an existing file or directory was requested.

There two lines in the .htaccess generated by Wordpress saved my blog:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

The story Digg linked to was /blog/designing-digg-picture-website/. I quickly created this directory on the server. Now the second RewriteCond rule started working and I instantly noticed the load dropping from 350 to a very low value. Then I created a stripped version of original article and placed it as an index.html file in the newly created directory. The stripped version looked like this. That's it, the load had dropped, the pages were being served, I was happy and readers were happy as well. :)

According to Statcounter my blog received 36'389 unique visitors during these two days:

catonmat statcounter traffic during launching digpicz

And the feed subscriber count (at FeedBurner) increased from 104 to 311 (During September 2):

catonmat feedburner subscribers after launching digpicz

I can't wait to see how many blog subscribers I will have once the data for Tuesday is updated! :)

I can tell you a little secret now. You know popurls.com, right? If you don't, then it's a website which aggregates all the best buzz around the net from sources like digg, reddit, del.icio.us, newsvine and many others.

I bought a nicely named domain last week and I am working on a picture aggregation website which finds all the latest posts on all the social websites and displays them on a single page. :)

Expect this website to be out by the end of the week, presumably Sunday. I will probably write a two-part article on how this baby was created, how much time it took me and I will release the full source code as usual.

Until next time! ;)

digpicz: diggs missing picture sectionRemember, I launched Reddit Media: intelligent fun online last week (read how it was made)?

I have been getting emails that it would be a wise idea to launch a Digg media website. Yeah! Why not?
Since Digg already has a video section there is not much point in duplicating it. The new site could just be digg for pictures.

Update 2008.07.30: I received this PDF, which said that I was abusing Digg's trademarks! So I closed the site. You may visit http://digg.picurls.com to see how it looked like. I also zipped up the contents of the site and you may download the whole site digpicz-2008-07-30.zip!

I don't want to use word 'digg' in the domain name because people warned me that the trademark owner could take the domain away from me. I'll just go with a single letter g as "dig" and pictures, to make it shorter picz. So the domain I bought is digpicz.com.

Update: The site has been launched, visit digpicz.com: digg's missing picture section. Time taken to launch the site: ~7 hours.

visit-digpicz-now

Reusing the Reddit Media Generator Suite

I released full source code of the reddit media website (reddit media website generator suite (.zip)). It can now be totally reused with minor modifications to suit the digg for pictures website.

Only the following modifications need to be made:

  • A new extractor (data miner) has to be written which goes through all the stories on digg and finds ones with pic/pics/images/etc. words in titles or descriptions (In reddit generator suite it was the reddit_extractor.pl program (in /scripts directory in .zip file)). Digg, as opposite to Reddit, provides a public API to access its stories. I will use this API to go through all the stories and create the initial database of pictures and further monitor digg's front page. This program will be called digg_extractor.pl
  • SQLite database structure has to be changed to include a link to Digg's story, story's description, a link to the user's avatar.
  • The generate_feed function in static HTML page generator (page_gen.pl) has to be updated to create a digpicz rss feed.
  • HTML template files in /templates directory (in the .zip file) need to be updated to give the site more digg-like look.

That's it! A few hours of work and we have a digg for pictures website running!

Digpicz Technical Design

Let's create the data miner first. As I mentioned it's called digg_extractor.pl, and it is a Perl script which uses Digg public API.

First, we need to get familiar with Digg API. Skimming over Basic API Concepts page we find just a few imporant points:

Next, to make our data miner get the stories, let's look at Summary of API Features. It mentions List Stories endpoint which "Fetches a list of stories from Digg." This is exactly what we want!

We are interested only in stories which made it to the front page, the API documentation tells us we should issue a GET /stories/popular request to http://services.digg.com.

I typed the following address in my web browser and got a nice XML response with 10 latest stories:

http://services.digg.com/stories/popular?appkey=http%3A%2F%2Fdigpicz.com

The documentation also lists count and offset arguments which control number of stories to retrieve and offset in complete story list.

So the general algorithm is clear, start at offset=0, loop until we go through all the stories, parse each bucket of stories and extract stories with pics in them.

We want to use the simplest Perl's library possible to parse XML. There is a great one from CPAN which is perfect for this job. It's called XML::Simple. It provides an XMLin function which given an XML string returns a reference to a parsed hash data structure. Easy as 3.141592!

This script prints out picture stories which made it to the front page in human readable format. Each story is printed as a paragraph:

title: story title
type: story type
desc: story description
url: story url
digg_url: url to original story on digg
category: digg category of the story
short_category: short digg cateogry name
user: name of the user who posted the story
user_pic: url to user pic
date: date story appeared on digg YYYY-MM-DD HH:MM:SS
<new line>

The script has one constant ITEMS_PER_REQUEST which defined how many stories (items) to get per API request. Currently it's set to 15 which is stories per one Digg page.

The script takes an optional argument which specifies how many requests to make. On each request, story offset is advanced by ITEMS_PER_REQUEST. Specifying no argument goes through all the stories which appeared on Digg.

For example, to print out current picture posts which are currently on the front page of Digg, we could use command:

./digg_extractor.pl 1

Here is a sample of real output of this command:

$ ./digg_extractor.pl 1
title: 13 Dumbest Drivers in the World [PICS]
type: pictures
desc: Think of this like an even funnier Darwin awards, but for dumbass driving (and with images).
url: http://wtfzup.com/2007/09/02/unlucky-13-dumbest-drivers-in-the-world/
digg_url: http://digg.com/offbeat_news/13_Dumbest_Drivers_in_the_World_PICS
category: Offbeat News
short_category: offbeat_news
user: suxmonkey
user_pic: http://digg.com/userimages/s/u/x/suxmonkey/large6009.jpg
date: 2007-09-02 14:00:06

This input is then fed into db_inserter.pl script which inserts this data into SQLite database.

Then page_gen.pl is ran which generates the static HTML contents.
Please refer to the original post of the reddit media website generator to find more details.

Summing it up, only one new script had to be written and some minor changes to existing scripts had to be made to generate the new website.

Here is this new script digg_extractor.pl:
digg extractor (perl script, digg picture website generator)

Click http://digg.picurls.com to visit the site!

Here are all the scripts packed together with basic documentation:

Download Digg's Picture Website Generator Scripts

All the scripts in a single .zip:
Download link: digg picture website generator suite (.zip)
Downloaded: 2823 times

For newcomers, digg is a democratic social news website where users decide its contents.

From their faq:

What is Digg?

Digg is a place for people to discover and share content from anywhere on the web. From the biggest online destinations to the most obscure blog, Digg surfaces the best stuff as voted on by our users. You won’t find editors at Digg — we’re here to provide a place where people can collectively determine the value of content and we’re changing the way people consume information online.

How do we do this? Everything on Digg — from news to videos to images to Podcasts — is submitted by our community (that would be you). Once something is submitted, other people see it and Digg what they like best. If your submission rocks and receives enough Diggs, it is promoted to the front page for the millions of our visitors to see.