peteris krumins interviewYesterday I was interviewed by Muhammad Saleem. We discussed my programming background, why I created RedditMedia and DigPicz, where do I go from now, and would I ever sell digg pics to Digg!

Read the whole interview with me.

I copied it here in case Muhammad's website ever goes down:

peter is 22 years old. he is studying physics and loves mathematics but his real passion is computer science. he has recently become famous for developing reddit media and digpicz.

hi peter. thanks for taking out the time to answer some questions.

let's start with some background.

sure. i got into programming around 1996. it was hard to get internet access here in latvia at that time so i befriended a unix sysadmin who was like my teacher at that time. anyway, from there, i met this friend on irc, who helped me with various computer-related things and introduced me to linux and programming.

my first experience with programming was creating an irc client. it was fun exploring this protocol. since then i have been programming both for fun and work in many various languages. at one point i even wrote an intrusion prevention system when i worked as an white-hat (ethical) hacker.

so how did you end up creating reddit media?

i have been an active reddit user for a while, since the content there is relatively better than other social sites. when they added the programming subreddit, it became my number one source of information. as for why i created reddit media, it was mostly so that people would have an easy way to access all the media (pictures and videos) that is submitted to reddit.

it was so easy to create it, that i just did it.

and how about diggpicz?

well, as i'm sure you know, digg users have been asking for a pictures section on the site for a long time. i saw that the digg developers kept saying that they would create the section but haven't done it yet so i decided to do it for digg users. i have had so much programming experiences that this is easy for me and allows me to help such a large community with relatively little effort on my part.

it was just a few hours of work to reuse code that i had written for reddit media to create digg picz.

so where do you go from here? it seems that digg's official pictures section is coming. is there any chance that they will seek help from you or would you consider working for/with them?

well, even when digg releases their own version, i won't feel that my efforts are wasted. creating the site was fun and i enjoyed helping thousands of users in the meantime. again, the site took only 7 hours to make. that said, i doubt that they will use my site, because i wrote it entirely in perl and optimized it for a shared hosting server, meaning that all the pages are pre-generated and there is no interaction with database in real time. and digg runs php, so it would be a complete re-write. my intention was never 'build-to-flip'

working for digg would be cool, but there is a problem with that. i am last year theoretical physics student and i graduate only in june 2008. i am applying for mit this november and if i get accepted then i will put all the effort in studying there and not working. if i don't get in, then i will definitely consider working for digg.

now that you have created a version of the site for reddit and a version for digg and seen how successful both of them have become, have you thought about releasing a pligg-like cms to let people create their own media sites?

yes i have thought about it, but first i would like to finish picurls.com (same idea as popurls.com but focused on pictures) which will be in perl again. in an upcoming version of picurls.com you will be able to choose from a variety of sources, which ones you want to see pictures from.

hopefully this site will launch by this sunday.

and you are funding all these projects out of your own pocket?

yes, i am funding all of this -literally- out of my own pocket, hence i am looking for someone who would sponsor me a dedicated linux server.

thanks for your time and i hope somebody will generously gift you a linux server.

digpicz digg’s missing picture sectionI am completely amazed by the amount of traffic and emails I have received after launching digpicz.com.

Digpicz received an incredible amount of 100'000 visitors during the first 20 hours! I launched it at 3am (my local time) and it got submitted to digg.com by Ilker Yoldas from TheThinkingBlog soon after, and it made to the front page of Digg two hours later.

According to statcounter.com by midnight it had been visited by 100'727 unique visitors. During two days it has been online (September 2 and September 3) 130'848 uniques have made 265'836 page requests.

digpicz, digg’s missing picture section, statcounter traffic stats

Since I was running it on a shared hosting server ($9.95/monthly at Dreamhost), I had foreseen that there might be problems with dynamically generating page on each request (for example, taking data from the database and creating HTML output using PHP or Perl) and all the pages on the website were simple pre-generated HTML documents. Once a new picture was added only the index.html (default page) was regenerated.

Using this technique helped a shared server with 1000 users (wc -l /etc/passwd) survive!

Soon after digpicz.com got posted to Digg, the story was picked up by TechCrunch, Mashable, FranticIndustries (which was first to blog about it :)) and many other sites.

Here is a picture from Google Analytics web statistics displaying top referrals (September 2 data only):

digpicz, digg’s missing picture section, google analytics referral statistics

It was not that nice with this blog. Since I posted how the site was made and made full source code of digpicz.com website generator available, the blog entry got posted to Digg itself as well.

I use Wordpress blog platform for this blog and it is not really designed with handling slashdot like effects (digg effect, for example) in mind. If no optimizations are done, traffic spike usually just bring the server to down to hell. With all the plugins I use on this blog, Wordpress makes like 40 SQL queries for each served page and then runs the whole content through filters and hooks and whatnot.

I did not expect the site to get Dugg that soon and had not made any optimizations to the blog.

The post made it to Digg's front page in 7 hours and the traffic brought the whole shared hosting server to knees. The blog stopped responding and I had to do something because I was losing many, many potential new readers and subscribers.

I tried installing Wp-Cache 2 plugin on my local development server with an idea to upload it via FTP to shared hosting but the plugin didn't want to work. I have no idea why and I didn't have time to debug the problem. I had to find another solution.

I had set up my blog to use permalink link structure and I remembered that the mod_rewrite directives (ps. here is an excellent mod_rewrite cheat sheet) in .htaccess file, which sets up the link structure, avoided rewriting links if an existing file or directory was requested.

There two lines in the .htaccess generated by Wordpress saved my blog:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

The story Digg linked to was /blog/designing-digg-picture-website/. I quickly created this directory on the server. Now the second RewriteCond rule started working and I instantly noticed the load dropping from 350 to a very low value. Then I created a stripped version of original article and placed it as an index.html file in the newly created directory. The stripped version looked like this. That's it, the load had dropped, the pages were being served, I was happy and readers were happy as well. :)

According to Statcounter my blog received 36'389 unique visitors during these two days:

catonmat statcounter traffic during launching digpicz

And the feed subscriber count (at FeedBurner) increased from 104 to 311 (During September 2):

catonmat feedburner subscribers after launching digpicz

I can't wait to see how many blog subscribers I will have once the data for Tuesday is updated! :)

I can tell you a little secret now. You know popurls.com, right? If you don't, then it's a website which aggregates all the best buzz around the net from sources like digg, reddit, del.icio.us, newsvine and many others.

I bought a nicely named domain last week and I am working on a picture aggregation website which finds all the latest posts on all the social websites and displays them on a single page. :)

Expect this website to be out by the end of the week, presumably Sunday. I will probably write a two-part article on how this baby was created, how much time it took me and I will release the full source code as usual.

Until next time! ;)

digpicz: diggs missing picture sectionRemember, I launched Reddit Media: intelligent fun online last week (read how it was made)?

I have been getting emails that it would be a wise idea to launch a Digg media website. Yeah! Why not?
Since Digg already has a video section there is not much point in duplicating it. The new site could just be digg for pictures.

Update 2008.07.30: I received this PDF, which said that I was abusing Digg's trademarks! So I closed the site. You may visit http://digg.picurls.com to see how it looked like. I also zipped up the contents of the site and you may download the whole site digpicz-2008-07-30.zip!

I don't want to use word 'digg' in the domain name because people warned me that the trademark owner could take the domain away from me. I'll just go with a single letter g as "dig" and pictures, to make it shorter picz. So the domain I bought is digpicz.com.

Update: The site has been launched, visit digpicz.com: digg's missing picture section. Time taken to launch the site: ~7 hours.

visit-digpicz-now

Reusing the Reddit Media Generator Suite

I released full source code of the reddit media website (reddit media website generator suite (.zip)). It can now be totally reused with minor modifications to suit the digg for pictures website.

Only the following modifications need to be made:

  • A new extractor (data miner) has to be written which goes through all the stories on digg and finds ones with pic/pics/images/etc. words in titles or descriptions (In reddit generator suite it was the reddit_extractor.pl program (in /scripts directory in .zip file)). Digg, as opposite to Reddit, provides a public API to access its stories. I will use this API to go through all the stories and create the initial database of pictures and further monitor digg's front page. This program will be called digg_extractor.pl
  • SQLite database structure has to be changed to include a link to Digg's story, story's description, a link to the user's avatar.
  • The generate_feed function in static HTML page generator (page_gen.pl) has to be updated to create a digpicz rss feed.
  • HTML template files in /templates directory (in the .zip file) need to be updated to give the site more digg-like look.

That's it! A few hours of work and we have a digg for pictures website running!

Digpicz Technical Design

Let's create the data miner first. As I mentioned it's called digg_extractor.pl, and it is a Perl script which uses Digg public API.

First, we need to get familiar with Digg API. Skimming over Basic API Concepts page we find just a few imporant points:

Next, to make our data miner get the stories, let's look at Summary of API Features. It mentions List Stories endpoint which "Fetches a list of stories from Digg." This is exactly what we want!

We are interested only in stories which made it to the front page, the API documentation tells us we should issue a GET /stories/popular request to http://services.digg.com.

I typed the following address in my web browser and got a nice XML response with 10 latest stories:

http://services.digg.com/stories/popular?appkey=http%3A%2F%2Fdigpicz.com

The documentation also lists count and offset arguments which control number of stories to retrieve and offset in complete story list.

So the general algorithm is clear, start at offset=0, loop until we go through all the stories, parse each bucket of stories and extract stories with pics in them.

We want to use the simplest Perl's library possible to parse XML. There is a great one from CPAN which is perfect for this job. It's called XML::Simple. It provides an XMLin function which given an XML string returns a reference to a parsed hash data structure. Easy as 3.141592!

This script prints out picture stories which made it to the front page in human readable format. Each story is printed as a paragraph:

title: story title
type: story type
desc: story description
url: story url
digg_url: url to original story on digg
category: digg category of the story
short_category: short digg cateogry name
user: name of the user who posted the story
user_pic: url to user pic
date: date story appeared on digg YYYY-MM-DD HH:MM:SS
<new line>

The script has one constant ITEMS_PER_REQUEST which defined how many stories (items) to get per API request. Currently it's set to 15 which is stories per one Digg page.

The script takes an optional argument which specifies how many requests to make. On each request, story offset is advanced by ITEMS_PER_REQUEST. Specifying no argument goes through all the stories which appeared on Digg.

For example, to print out current picture posts which are currently on the front page of Digg, we could use command:

./digg_extractor.pl 1

Here is a sample of real output of this command:

$ ./digg_extractor.pl 1
title: 13 Dumbest Drivers in the World [PICS]
type: pictures
desc: Think of this like an even funnier Darwin awards, but for dumbass driving (and with images).
url: http://wtfzup.com/2007/09/02/unlucky-13-dumbest-drivers-in-the-world/
digg_url: http://digg.com/offbeat_news/13_Dumbest_Drivers_in_the_World_PICS
category: Offbeat News
short_category: offbeat_news
user: suxmonkey
user_pic: http://digg.com/userimages/s/u/x/suxmonkey/large6009.jpg
date: 2007-09-02 14:00:06

This input is then fed into db_inserter.pl script which inserts this data into SQLite database.

Then page_gen.pl is ran which generates the static HTML contents.
Please refer to the original post of the reddit media website generator to find more details.

Summing it up, only one new script had to be written and some minor changes to existing scripts had to be made to generate the new website.

Here is this new script digg_extractor.pl:
digg extractor (perl script, digg picture website generator)

Click http://digg.picurls.com to visit the site!

Here are all the scripts packed together with basic documentation:

Download Digg's Picture Website Generator Scripts

All the scripts in a single .zip:
Download link: digg picture website generator suite (.zip)
Downloaded: 2799 times

For newcomers, digg is a democratic social news website where users decide its contents.

From their faq:

What is Digg?

Digg is a place for people to discover and share content from anywhere on the web. From the biggest online destinations to the most obscure blog, Digg surfaces the best stuff as voted on by our users. You won’t find editors at Digg — we’re here to provide a place where people can collectively determine the value of content and we’re changing the way people consume information online.

How do we do this? Everything on Digg — from news to videos to images to Podcasts — is submitted by our community (that would be you). Once something is submitted, other people see it and Digg what they like best. If your submission rocks and receives enough Diggs, it is promoted to the front page for the millions of our visitors to see.

john resig post icon jquery library designOne of my upcoming web projects uses the AJAX technology and jQuery. I had watched a dozen of video lectures on JavaScript before and thought that I have seen them all. Today I came across John Resig's website and found that he just had been at Google and gave a video lecture on Best Practices in JavaScript Library Design.

John Resig is a JavaScript Evangelist, working for the Mozilla Corporation, and the author of the book 'Pro JavaScript Techniques.' He's also the creator and lead developer of the jQuery JavaScript library and the co-designer of the FUEL JavaScript library (included in Firefox 3).

This talk explores all the techniques used to build a robust, reusable, cross-platform JavaScript Library. We'll look at how to write a solid JavaScript API, show you how to use functional programming to create contained, concise, code, and delve deep into common cross browser issues that you'll have to solve in order to have a successful library.

Here is the video:

And here are the slides:

Things that caught my attention in video:

  • (01:49) jQuery was released on Jan. 2006 and it's main focus is on DOM traversal.
  • (05:39) Similar objects should have the same method and property names so there was minimal learning curve.
  • (07:16) Fear adding methods to the API. You should keep the API as small as possible.
  • (11:38) jQuery 1.1 removed reduced size by 47% by removing unnecessary methods, breaking compatibility with jQuery 1.0. A plugin for 1.1 was released as a separate package to provide the old 1.0 interface.
  • (14:10) Look for common patterns in the API and reduce it to its code.
  • (15:33) Be consistent within your API, stick to a naming scheme and argument positioning.
  • (17:11) Evolution of a JavaScript coder:
    • Everything is a reference!
    • You can do OO code!
    • Huh, so that's how Object Prototypes work!
    • Thank God for closures!
  • (21:20) In JavaScript 1.7 there is a let statement which declares variables local to a block.
  • (22:43) If you wrap your entire library in (function() { ... library code ... })() the code will never mess other library code.
  • (24:10) Some namespacing questions:
    • (24:17) Can my code coexist with other random code on the site?
    • (24:50) Can my code coexist with other copies of my own library?
    • (25:16) Can my code be embedded inside another namespace?
  • (25:42) Never extend native objects, ever!
  • (26:39) A JavaScript library should: first, work cross browser; second, have functionality.
  • (32:09) You can tweak your Object constructor to make Constructor() work the same as new Constructor().
  • (34:37) There are three ways to extend jQuery, you can add methods, selectors and animations.
  • (36:37) Message passing from one component to another is best done via custom events.
  • (37:45) Quirksmode is a fantastic resource which explains where specific bugs exist in the browsers.
  • (38:53) DOM Events and DOM Traversal problems are solved in depth, many others, such as getting an attribute and getting the computer style still require hard work.
  • (41:09) In Safari 2 the getComputedStyle is null if it's called on an element with display: none or on an element which is within an element with display: none. Safari 3 implemented the interface but they just return undefined.
  • (45:08) Use structured format for documentation. What's nice about it is that it can be converted to other formats and given to users.
  • (48:16) Let your users help you by putting out documentation in a Wiki.
  • (49:36) Don't trust any library that doesn't have a test suite.

Here is the QA:

  • (52:40) Is jQuery more or less targeted on FireFox and wouldn't actually be reasonable to use, say, on a cellphone?
  • (53:29) How do you filter noise in community?
  • (54:30) Is jQuery going to get multiple build sets?
  • (55:05) Will there ever be time when library development like this is not necessary anymore? Or do you think that the ecosystem of libraries is good for advancing state of the art?
  • (56:34) How do you compare jQuery to something like Google Web Toolkit
  • (57:12) What's the largest project jQuery is used in, in terms of size and development team, and code base.

Have fun learning better JavaScript!

When I was developing the reddit media: intelligent fun online website, I needed to embed reddit's up/down voting buttons to allow users to cast votes on media links directly from the site.

reddit up down vote box redditmedia

I remembered that reddit had decided not to display posts with a submission time less than two hours ago.

reddit post less than two hours ago

This left me thinking, if the scores are not displayed for new posts, what's the point of having vote boxes on a just posted article page? I thought, it wouldn't make sense if it wasn't available. Quickly did I find a link on reddit's new page which seemed to have received a few votes and added a reddit's button to an empty HTML document.

A reddit voting button/widget can be embedded on a site by putting the following JavaScript code fragment anywhere in the HTML source:

<script>reddit_url='[URL]'</script>
<script>reddit_title='[TITLE]'</script>
<script language="javascript" src="http://reddit.com/button.js?t=2"></script>

where the URL is the URL to the article and TITLE is the title of the article.

Voila! I now know something nobody else did - how many votes had the post received!

found score of reddit post before general public

NOW, let's create something cool for general public to use so that anyone could reveal the scores for all recently posted links :)

Let's use the FireFox browser and its excellent GreaseMonkey add-on/extension.

What is GreaseMonkey you might wonder?

Greasemonkey is a Firefox extension that allows you to write scripts that alter the web pages you visit. You can use it to make a web site more readable or more usable. You can fix rendering bugs that the site owner can't be bothered to fix themselves. You can alter pages so they work better with assistive technologies that speak a web page out loud or convert it to Braille. You can even automatically retrieve data from other sites to make two sites more interconnected.

Greasemonkey by itself does none of these things. In fact, after you install it, you won't notice any change at all... until you start installing what are called "user scripts". A user script is just a chunk of Javascript code, with some additional information that tells Greasemonkey where and when it should be run. Each user script can target a specific page, a specific site, or a group of sites. A user script can do anything you can do in Javascript. In fact, it can do even more than that, because Greasemonkey provides special functions that are only available to user scripts.

Do you see where I am aiming? I will write a "user script" in JavaScript programming language which I just learned in more details to find the "just posted" links on a reddit page and replace the original up/down vote box with the vote box widget which reveals the current count of votes!

There is a great free book available on GreaseMonkey which explains it through code examples. It's called "Dive into GreaseMonkey." It's only 99 pages long and can be read in an hour if you know JavaScript already!

Writing the User Script

The basic idea of the script is to parse the DOM of reddit's page, extracting all the posted links and find the links which do not have score displayed (which are newer than 2 hours), then replace the HTML of original up/down vote box with the widget's HTML.

First we need to understand how reddit's entries are layed out on the page. To do this we could view the HTML source of the page, but this method requires too much effort for us because we'd have to prase HTML in our heads. Let's use something more visual. There is an extension to FireFox called FireBug which allows to explore the HTML of a page in a much nicer manner.

reddit firebug entry html

We see that each entry on the page is wrapped in two <tr> elements, where each of them have a class name "oddRow" or "evenRow". Our GreaseMonkey user script will have to find these rows and extract the title information from the first row, and date information from the second row.
To do this we use the DOM's getElementsByTagName function to retrieve all <tr> elements on the page, next we loop over these elements matching those having a class name "oddRow" or "evenRow" and maintain a state whether we are matching the first or the second row and call extraction functions for each row accordingly.
Look at find_entries function in the final script to see how it extracts all the entries from the page.

Once we have extracted the entries, all we have to do is replace the HTML of the original up/down vote box with HTML of a up/down vote widget. (See the display_votes function in the final script)

And we are done!

Here is how the reddit page looks like when the GreaseMonkey script has run:

greasemonkey up down vote box reddit page

Reddit Score Revealer GreaseMonkey Script

Download link: reddit score revealer (greasemonkey script)
Downloaded: 2378 times

Notes: this script works only with the wonderful FireFox browser. To run it you will also need GreaseMonkey extension.

Once you click the link, FireFox will automatically ask you if you want to install this script. Select "Install" and visit reddit.com/new to have infinite power over the regular users!

Here is a screenshot of how the GreaseMonkey user script Installation dialog looks like:

greaseMonkey user script installation dialog

ps. Has anyone successfully debugged GreaseMonkey scripts? I could not find a way to set breakpoints or even load the user script in any of the debuggers which come with FireFox. Any suggestions?

pss. The current implementation of the script replaces the original up/down vote box with an iframe. This is kind of ugly. I'll leave it as an exercise to a curious user to change the script to retrieve the score via XMLHttpRequest interface and change the "published NN minutes/hours ago" status line to one with the score in it. :)