john resig post icon jquery library designOne of my upcoming web projects uses the AJAX technology and jQuery. I had watched a dozen of video lectures on JavaScript before and thought that I have seen them all. Today I came across John Resig's website and found that he just had been at Google and gave a video lecture on Best Practices in JavaScript Library Design.

John Resig is a JavaScript Evangelist, working for the Mozilla Corporation, and the author of the book 'Pro JavaScript Techniques.' He's also the creator and lead developer of the jQuery JavaScript library and the co-designer of the FUEL JavaScript library (included in Firefox 3).

This talk explores all the techniques used to build a robust, reusable, cross-platform JavaScript Library. We'll look at how to write a solid JavaScript API, show you how to use functional programming to create contained, concise, code, and delve deep into common cross browser issues that you'll have to solve in order to have a successful library.

Here is the video:

And here are the slides:

Things that caught my attention in video:

  • (01:49) jQuery was released on Jan. 2006 and it's main focus is on DOM traversal.
  • (05:39) Similar objects should have the same method and property names so there was minimal learning curve.
  • (07:16) Fear adding methods to the API. You should keep the API as small as possible.
  • (11:38) jQuery 1.1 removed reduced size by 47% by removing unnecessary methods, breaking compatibility with jQuery 1.0. A plugin for 1.1 was released as a separate package to provide the old 1.0 interface.
  • (14:10) Look for common patterns in the API and reduce it to its code.
  • (15:33) Be consistent within your API, stick to a naming scheme and argument positioning.
  • (17:11) Evolution of a JavaScript coder:
    • Everything is a reference!
    • You can do OO code!
    • Huh, so that's how Object Prototypes work!
    • Thank God for closures!
  • (21:20) In JavaScript 1.7 there is a let statement which declares variables local to a block.
  • (22:43) If you wrap your entire library in (function() { ... library code ... })() the code will never mess other library code.
  • (24:10) Some namespacing questions:
    • (24:17) Can my code coexist with other random code on the site?
    • (24:50) Can my code coexist with other copies of my own library?
    • (25:16) Can my code be embedded inside another namespace?
  • (25:42) Never extend native objects, ever!
  • (26:39) A JavaScript library should: first, work cross browser; second, have functionality.
  • (32:09) You can tweak your Object constructor to make Constructor() work the same as new Constructor().
  • (34:37) There are three ways to extend jQuery, you can add methods, selectors and animations.
  • (36:37) Message passing from one component to another is best done via custom events.
  • (37:45) Quirksmode is a fantastic resource which explains where specific bugs exist in the browsers.
  • (38:53) DOM Events and DOM Traversal problems are solved in depth, many others, such as getting an attribute and getting the computer style still require hard work.
  • (41:09) In Safari 2 the getComputedStyle is null if it's called on an element with display: none or on an element which is within an element with display: none. Safari 3 implemented the interface but they just return undefined.
  • (45:08) Use structured format for documentation. What's nice about it is that it can be converted to other formats and given to users.
  • (48:16) Let your users help you by putting out documentation in a Wiki.
  • (49:36) Don't trust any library that doesn't have a test suite.

Here is the QA:

  • (52:40) Is jQuery more or less targeted on FireFox and wouldn't actually be reasonable to use, say, on a cellphone?
  • (53:29) How do you filter noise in community?
  • (54:30) Is jQuery going to get multiple build sets?
  • (55:05) Will there ever be time when library development like this is not necessary anymore? Or do you think that the ecosystem of libraries is good for advancing state of the art?
  • (56:34) How do you compare jQuery to something like Google Web Toolkit
  • (57:12) What's the largest project jQuery is used in, in terms of size and development team, and code base.

Have fun learning better JavaScript!

When I was developing the reddit media: intelligent fun online website, I needed to embed reddit's up/down voting buttons to allow users to cast votes on media links directly from the site.

reddit up down vote box redditmedia

I remembered that reddit had decided not to display posts with a submission time less than two hours ago.

reddit post less than two hours ago

This left me thinking, if the scores are not displayed for new posts, what's the point of having vote boxes on a just posted article page? I thought, it wouldn't make sense if it wasn't available. Quickly did I find a link on reddit's new page which seemed to have received a few votes and added a reddit's button to an empty HTML document.

A reddit voting button/widget can be embedded on a site by putting the following JavaScript code fragment anywhere in the HTML source:

<script>reddit_url='[URL]'</script>
<script>reddit_title='[TITLE]'</script>
<script language="javascript" src="http://reddit.com/button.js?t=2"></script>

where the URL is the URL to the article and TITLE is the title of the article.

Voila! I now know something nobody else did - how many votes had the post received!

found score of reddit post before general public

NOW, let's create something cool for general public to use so that anyone could reveal the scores for all recently posted links :)

Let's use the FireFox browser and its excellent GreaseMonkey add-on/extension.

What is GreaseMonkey you might wonder?

Greasemonkey is a Firefox extension that allows you to write scripts that alter the web pages you visit. You can use it to make a web site more readable or more usable. You can fix rendering bugs that the site owner can't be bothered to fix themselves. You can alter pages so they work better with assistive technologies that speak a web page out loud or convert it to Braille. You can even automatically retrieve data from other sites to make two sites more interconnected.

Greasemonkey by itself does none of these things. In fact, after you install it, you won't notice any change at all... until you start installing what are called "user scripts". A user script is just a chunk of Javascript code, with some additional information that tells Greasemonkey where and when it should be run. Each user script can target a specific page, a specific site, or a group of sites. A user script can do anything you can do in Javascript. In fact, it can do even more than that, because Greasemonkey provides special functions that are only available to user scripts.

Do you see where I am aiming? I will write a "user script" in JavaScript programming language which I just learned in more details to find the "just posted" links on a reddit page and replace the original up/down vote box with the vote box widget which reveals the current count of votes!

There is a great free book available on GreaseMonkey which explains it through code examples. It's called "Dive into GreaseMonkey." It's only 99 pages long and can be read in an hour if you know JavaScript already!

Writing the User Script

The basic idea of the script is to parse the DOM of reddit's page, extracting all the posted links and find the links which do not have score displayed (which are newer than 2 hours), then replace the HTML of original up/down vote box with the widget's HTML.

First we need to understand how reddit's entries are layed out on the page. To do this we could view the HTML source of the page, but this method requires too much effort for us because we'd have to prase HTML in our heads. Let's use something more visual. There is an extension to FireFox called FireBug which allows to explore the HTML of a page in a much nicer manner.

reddit firebug entry html

We see that each entry on the page is wrapped in two <tr> elements, where each of them have a class name "oddRow" or "evenRow". Our GreaseMonkey user script will have to find these rows and extract the title information from the first row, and date information from the second row.
To do this we use the DOM's getElementsByTagName function to retrieve all <tr> elements on the page, next we loop over these elements matching those having a class name "oddRow" or "evenRow" and maintain a state whether we are matching the first or the second row and call extraction functions for each row accordingly.
Look at find_entries function in the final script to see how it extracts all the entries from the page.

Once we have extracted the entries, all we have to do is replace the HTML of the original up/down vote box with HTML of a up/down vote widget. (See the display_votes function in the final script)

And we are done!

Here is how the reddit page looks like when the GreaseMonkey script has run:

greasemonkey up down vote box reddit page

Reddit Score Revealer GreaseMonkey Script

Download link: reddit score revealer (greasemonkey script)
Downloaded: 2401 times

Notes: this script works only with the wonderful FireFox browser. To run it you will also need GreaseMonkey extension.

Once you click the link, FireFox will automatically ask you if you want to install this script. Select "Install" and visit reddit.com/new to have infinite power over the regular users!

Here is a screenshot of how the GreaseMonkey user script Installation dialog looks like:

greaseMonkey user script installation dialog

ps. Has anyone successfully debugged GreaseMonkey scripts? I could not find a way to set breakpoints or even load the user script in any of the debuggers which come with FireFox. Any suggestions?

pss. The current implementation of the script replaces the original up/down vote box with an iframe. This is kind of ugly. I'll leave it as an exercise to a curious user to change the script to retrieve the score via XMLHttpRequest interface and change the "published NN minutes/hours ago" status line to one with the score in it. :)

bjarne stroustrup video lecture c++0x iso standard c++09 iconWhile browsing my favorite programming news site programming.reddit.com links I stumbled accorss this link to a video lecture on C++ upcoming standard C++0x by no one else than Bjarne Stroustrup himself!

You can start watching it right away or you can download it in DivX, MPEG and other formats.

I have a great interest in the C and C++ family of programming languages and their history, and I have read two of Bjarne's books - C++ Programming Language and The Design and Evolution of C++. I enjoyed every page of these books and they made me not only a decent C++ programmer but also made me understand how the language was formed, what it's goals were, where it was headed and how the language got various constructs it has now. If you ever consider becoming a great C++ programmer, these books are a definite read.

The most fundamental things these books taught me was to think think of various levels of abstraction and approaching a given programming problem from various programming paradigms.

When I found the link I put aside all the things I was working on and started watching the video lecture! I love C++ that much!

A note aside for people wanting to learn C++. I see people argue on programming.reddit.com and other sites that C++ is not worth learning, that it's is a dead language and X is better than C++, etc. Don't listen to this crap! If you ever watched Guy Kawasaki's The Art of Start video presentation, the 11th point of success is "Don't let the bozos grind you down." That's what they are trying to do if you listen to them! Just start learning C++ and you will succeed with it!

Now, back to the lecture. Here, I cite what the lecturer has to say about his lecture:

A good programming language is far more than a simple collection of features. My ideal is to provide a set of facilities that smoothly work together to support design and programming styles of a generality beyond my imagination. Here, I briefly outline rules of thumb (guidelines, principles) that are being applied in the design of C++0x. Then, I present the state of the standards process (we are aiming for C++09) and give examples of a few of the proposals such as concepts, generalized initialization, being considered in the ISO C++ standards committee. Since there are far more proposals than could be presented in an hour, I'll take questions.

Just like I did while learning JavaScript from video lectures, I am going to timestamp blog about most interesting things that caught my attention!

Here I list the things that caught my attention in Bjarne's C++ video presentation. Time in the brackets is when it appeared on the video. '+' before the brackets indicate that I knew it already, '-' that I didn't (just for personal notes). I will write down some obvious facts about the language even though I know them, so you got an idea what the lecture was about.

  • +(03:25) C++ is used on both Mars rovers, Spirit and Opportunity. Design life of the rowers was 6 months.
  • +(05:30) C++ is a better C in a way that it can roughly do the same as C but also has many new features
  • +/-(06:55) Highest level goals of C++ are to make it a better language for systems programming and library building and make it easier to teach and learn.
  • (08:46) Joke: The next Intels will execute infinite loop in five minutes and that's why you don't need performance. :)
  • +/-(10:00) The main problem for a new revision of the standard is the popularity of C++. Existing and new users want countless improvements. Adding a new feature needs to keep the existing code absolutely stable. Each new feature makes the language harder to learn.
  • +(15:46) The current C++ ISO standard is from 1998, with a revision in 2003.
  • (15:50) Joke: If you can tell the difference between C++ 1998 and C++ 03, then you have just been reading too many manuals.
  • (16:36) Joke: Some of C++ language developers working on C++0x are very, very keen to get that x to be a decimal number.
  • +(17:20) Voting on C++ standard features is done nation-wide. Each nation casts one vote.
  • +/-(19:14) Standardization matters because it directly affects millions of peoples, new techniques need to get into mainstream use, it's a defense against vendor lock-in
  • +/-(22:10) Rules of thumb for the standard:
    • Maintain stability and compatibility,
    • Prefer libraries to language extensions,
    • Prefer generality to specialization,
    • Support both experts and novices,
    • Increase type safety,
    • Improve performance and ability to work directly with hardware,
    • Fit into the real world,
    • Make only changes that changes the way people think.
  • -(30:28) There are around 100 new proposals for language features.
  • -(33:22) There are much less library proposals. Just 11 new proposals for Library TR1!
  • -(35:47) Areas of language change are machine model and concurrency, modules and DLLs, support for generic programming.
  • +(38:01) Vector initialization problem example.
  • (40:50) Even Bjarne made an error when using the verbose syntax for initializing a vector with a list of values from an array:

    bjarne vector initialization mistake int double

    "If you have tedious, verbose and indirect code, you make mistakes!"
    /Bjarne Stroustrup/

  • -(41:09) Indirect vector initialization from arrays violates Stroustrup's language design principle (from The Design and Evolution of C++) - "Support user-defined and built-in types equally well."
  • -(41:56) C++0x solution to initialization problem are initialized lists, std::initializer_list<type>.
  • -(43:43) There are too many ways to initialize things in C++ and they work in various contexts, C++0x introduces uniform initialization syntax which can be used in any initialization.
  • -(46:50) Fundamental cause of lots of problems in C++ with generic programming is that the compiler doesn't know what template argument types are supposed to do.
  • +/-(49:30) C++ 98 got templates right in a way that parametrization didn't require hierarchies, parametrization could be done with non-types, the code generated had uncompromising efficiency and that it turned out that template instantiation was Turing complete!
  • -(54:42) Concept aims of C++0x are direct expression of intent (lecture got cut here (they ran out of type or something) :( and the next moment was somewhere in the future), no performance degradation compared to current code, relatively easy implementation within current compilers and that current template code must remain valid.
  • -(55:33) Lecture continues here from where it was cut. It's something about type system how it makes sure correct data types using just declarations, and about compile time type contracts through templates.
  • -(1:06:14) Quick summary: template aliases, initializer lists, overloading based on concepts, type deduction from initializers, a new for loop for ranges.

After the lecture the following questions were asked:

  • (01:09:40) What's your opinion about the Microsoft implementation of C++?
    • A: Microsoft's implementation is the the best out there, they conform to the standards pretty well and the code generated is also good. GNU gcc is also good. Though, they want you to use their "Managed C++" called C++/CLI which is totally unportable. Apple does the same with their version of C++ which is Objective C/C++ and and so does GNU. They all play this game of trying to get users just to use their product and not switch to their competitor products.

  • (01:11:56) Do you think you'll ever design a new language from scratch?
    • A: Certainly not from scratch. You have to answer the question, why are you designing a language? You design a language to solve a certain problem. If I ever designs a new language it will be because I feel that some problem needs a solution.

  • (01:13:39) You mentioned threads, are there other things like transactions and cache mangement?
    • A: Concurrency is becoming very important. The question is how do you do it? My solution is to provide language primitives out of which you build libraries that use these primitives and provide various models of concurrency. Doing it directly with language primitives is too hard.

  • (01:16:25) How long after the standard is out do you expect to see a production compiler?
    • A: After the C++0x standard is approved and released the vendors will start releasing compilers right away. Some of them have already built in some of the upcoming features.

  • (01:17:55) Is auto like a type inference?
    • A: auto is kinda type inference but it's very simple. You simply look at the type of initializer and you use it.

  • (01:18:47) Would it be useful to have a switch in every compiler for deprecating features?
    • A: Yes, that would be useful because the compilers have to support old features which we would like to get rid of ever. I have not been able to convince compiler makes to do it.

  • (01:19:16) Is it possible to do garbage collection (GC) cleanly and efficiently in C++?
    • A: Yes, it is possible to do GC in C++. An implementation already exists and I will have a discussion tomorrow on whether to put it in standard. There are two problems, though. One is that people would start writing poor code never caring to free the used memory which would lead to poor performance. The other is that GC can be a performance virus.

  • (01:24:39) A lot of academic institutions have dropped teaching C++. As a result there are a lot of poor coding practices and poor coding solutions coming in from people. Are there any plans to have some documentation on how it would be more teachable?
    • A: I have become an academic for the last couple of years and someone talked me into teaching undergrads. I am more used to serious Ph.D's from good universities with 10 years of experience and it's not quite the same! :) I tried out ideas and I wrote a text book which will get out some time next year.

  • (01:26:24) How soon after you created C++ did you see it start to take over the industry?
    • A: The first commercial release was in 1985. I had access to data how many C++ users there were and kept track of it during 80s. From 1979 till 1991 the doubling rate was 7.5 months. And now we are at 3 million users.

  • (01:28:25) A lot of template classes at the moment use template hoisting to make them more efficient in terms of code size at compilation time. Is there anything being done to address the issues that make it necessary?
    • A: There is a trick of avoiding a lot of separate template instantiations based on void pointer. I don't see any changes to that. That's a portable way of doing it.

  • (01:29:50) What's your opinion on generic programming at runtime level?
    • A: It would be a good idea, but what's mostly called generic programming at runtime level has either so many indirections that it runs at 1/10ths of the speed of non-generic code or it's not too generic and you can't do any of the interesting stuff.

  • (01:31:33) You talked about having user defined types act the same way as built in types. Pointers are used for various optimizations like function overloading and smart pointers. Do you see a problem here? Is it being solved?
    • A: First of all, I think smart pointers are overused. Secondly, we can emulate inheritance with smart pointers. You can basically build a perfect smart pointer now. I worry about smart pointers because if you use a smart pointer and I give you one and we have no agreement on how mine works, we got a race condition. We have no lock on this code and we got two pieces of code which poke in the same area. You have to be very careful of the semantics of this smart pointer.

  • (01:33:38) There are interesting parallels between templates and duck typing used in dynamic languages. Will templates overtake classes for writing code and filing contracts?
    • A: Yes, templates has roughly the same as duck typing in scripting languages done dynamically. I think interfaces will be much better specified with concepts and there is still a large components of duck typing. Templates are becoming more important. Please remember that templates by themselves are nothing! They help you to abstract you over something.

  • (01:37:14) Have you ever gotten any death threads because of the changes in the language?
    • A: I have never gotten any death threads for any reason. And lets keep it that way!

  • (01:37:28) Is there any particular naming convention you subscribe to?
    • A: Yes, I like underscores. I do not like the camel stuff, it's less readable.

  • (01:37:57) The new language features you come up with. There are so many languages upcoming right now? Do you try to reuse any of the things they have done?
    • A: I try to learn from new languages, particularly, the users of new language. But grafting from one language to another is much harder than most people think. When you see something work in one language, then you see what problem are they solving, can we solve as elegantly in C++? If the answer is no, then we see how it can be solved and see the way it was done in other language. But simple grafting is a very hard exercise.

  • (01:38:52) When you initially designed the language, did you start from rigorous specifications or how did it start?
    • A: I am trying to be rigorous, but it's still informal in a sense that it is written in English. I started out with C specification written by Dennis Ritchie. Some things have improved, some have become more obscure because of the more words people use. We have tried several times to see if we can make it also formal. It would be nice to have formal sematics either for all of it or parts of it. It has not been that successful over the years. But I am very happy to report that a group from IBM this year managed to prove that C++ inheritance system was formally sound. It's proven. So 20 years later they proved that I didn't screw up.

  • (01:40:18) With Sun releasing some hardware which runs Java bytecode, are you afraid that it could take away C++'s embedded position?
    • A: Java would kill C++ totally in 2 years, Sun said in 1996. They sort of been repeating this story over and over again. There is a lot of Java, and there is a lot of C++ and it's a big world.

  • (01:41:05) How do you balance things at compile time and runtime, for example exceptions?
    • A: If you need to have balance, something at runtime, then you have to have it at runtime. For example, most of the good uses of virtual functions can't be done at runtime because you don't have the information. Talking about exceptions, there are compilers which add no overhead if no exceptions are thrown. There are trade offs and some things you just need to do at runtime.

You can watch the lecture right here as an embedded flash video, or you can download the this lecture:

ascii plain text unix sed ed awk cheat sheets txt formatEver since I published my personal sed, ed and awk cheat sheets in .pdf and .doc formats, I have been receiving suggestions that I should also create plain text versions of them. People said that it was ridiculous to have UNIX tool cheat sheets in .pdf or Microsoft Word (.doc) formats and not to have them in plain text.

I agreed and converted the UNIX tool cheat sheets to plain text format and did some ASCII art formating so they looked neat.

Enjoy!

(If you also want to download printable .pdf or .doc of these cheat sheets, follow the three links at the beginning of this post!)

UNIX Power Tool Cheat Sheets

AWK Cheat Sheet (.txt):
Download link: awk cheat sheet (.txt)
Downloaded: 109050 times

Sed Cheat Sheet (.txt):
Download link: sed stream editor cheat sheet (.txt)
Downloaded: 38061 times

Ed Cheat Sheet (.txt):
Download link: ed text editor cheat sheet (.txt)
Downloaded: 20104 times

PS. if you notice any bugs, spelling mistakes or just want to thank me, leave a comment :)

reddit media website post iconDuring my usage of reddit, I have observed that many titles have "(Pic)" or "[Picture]", "(Video)", etc. after them. It means that the contents the link points to has a picture or video in it. Sometimes I want to have fun with my friends and go through all the pics or vids. Unfortunately reddit's search is broken and there is really no good way to see the best pics and videos voted on reddit in the past.

I decided to create reddit media site which will monitor reddit's front page, collect picture & video links, and build an archive of them over time.

The site has been launched:
visit reddit media now

In (read more about it on about this blog page) page I wrote about one of the methods I like to use when developing software (and this project requires writing a few tools quickly, more about them below). It is call "the hacker's approach". The hacker's approach method is basically writing software as fast as possible using everything available and not thinking much about the best development practices, and not worrying what others will think about your code. If you are a good programmer the code quality produced is just a bit worse than writing it carefully but the time saved is enormous.

I will release full source code of website with all the programs generating the website. Also I will blog how the tools work and what ideas I used.

Update: Done! The site is up at reddit media: intelligent fun online.

Reddit Media Website's Technical Design Sketch

I use DreamHost shared hosting to run this website. Overall it is great hosting company and I have been with them for more than a year now! Unfortunately since it is a shared hosting, sometimes the server gets overloaded and serving of dynamic pages can become slow (a few seconds to load).

I want the new website to be as fast as possible even when the server is a bit loaded. I do not want any dynamic parsing to be involved when accessing the website. Because of this I will go with generating static HTML pages.

A Perl script will run every 30 mins from crontab, get reddit.com website, extract titles and URLs. Another script will add the titles to the lightweight sqlite on-disk database in case I ever want to make the website dynamic. And the third script will use the entries in the database and generate HTML pages.

Technical Design

A knowledgeable user might ask if this design does not have a race-condition at the moment the new static page is generated and user requesting the same page. The answer is no. The way new pages will be generated is that they will be written to temporary files, then moved in place of the existing ones. The website runs on Linux operating system and by looking up `man 2 rename' we find that

If newpath already exists it will be atomically replaced (subject to a
few conditions - see ERRORS below), so that there is no point at which
another process attempting to access(2,5) newpath will find it missing.

rename system call is atomic which means we have no trouble with race conditions!

Reddit provides RSS feed to the front page news. It has 25 latest news and maybe 5 are media links. That is not enough links to launch the website. People visiting the site will get bored with just 5 links and a few new added daily. I need more content right at the moment I launch the site. Or I could to launch the site later when articles have piled up. Unfortunately, I do not want to wait and I want to launch it ASAP! The hacker's approach!

First, I will create a script which will go through all the pages on reddit looking for picture and video links, and insert the found items in the database. It will match patterns in link titles and will match domains which exclusively contain media.
Here is the list of patterns I could come up with which describe pictures and videos:

  • picture
  • pic
  • image
  • photo
  • comic
  • chart
  • video
  • vid
  • clip
  • film
  • movie

And here are the domains found on youtube which exclusively contain media:

  • youtube.com
  • video.google.com
  • liveleak.com
  • break.com
  • metacafe.com
  • brightcove.com
  • dailymotion.com
  • flicklife.com
  • flurl.com
  • gofish.com
  • ifilm.com
  • livevideo.com
  • video.yahoo.com
  • photobucket.com
  • flickr.com
  • xkcd.com

To write this script I will use LWP::UserAgent to get HTML contents and HTML::TreeBuilder to extract titles and links.

This script will output the found items in human readable format, ready for input to another script which will absorb this information and put it in the SQLite database.

This script is called 'reddit_extractor.pl'. It takes one optional argument which is number of reddit pages to extract links from. If no argument is specified, it goes through all reddit pages until it hits the last one. For example, specifying 1 as the first argument makes it parse just the front page. I can now run this script periodically to find links on the front page. No need for parsing RSS.

There is one constant in this script which can be changed. This constant, VOTE_THRESHOLD, sets the threshold of how many votes a post on reddit should have received to be collected by our program. I had to add it because when digging in older reddit's posts, media with 1 or 2 votes can be found which means it really wasn't that good.

The script outputs each media post matching a pattern or domain in the following format:

title (type, user, reddit id, url)
  • title is the title of the article
  • type is the media type. It can be one of 'video', 'videos', 'picture', 'pictures'. It's plural if the title contains "pics" or "videos" (plural) form of media.
  • user is the reddit user who posted the link
  • reddit id is the unique identifier reddit uses to identify its links
  • url is the url to the media

Script 'reddit_extractor.pl' can be viewed here:
reddit extractor (perl script, reddit media generator)

Then I will create a script which takes this input and puts it into SQLite database. It is so trivial that there is nothing much to write about it.

This script will also be written in Perl programming langauge and will use just DBI and DBD::SQLite modules for accessing the SQLite database.

The script will create an empty database on the first invocation, read the data from stdin and insert the data in the database.

The database design is dead simple. It contains just two tables:

  • reddit which stores the links found on reddit, and
  • reddit_status which contains some info about how the page generator script used the reddit table

Going into more details, reddit table contains the following colums:

  • id - the primary key of the table
  • title - title of the media link found on reddit
  • url - url to the media
  • reddit_id - id reddit uses to identify it's posts (used by my scripts to link to comments)
  • user - username of the person who posted the link on reddit
  • type - type of the media, can be: 'video', 'videos', 'picture', 'pictures'. It's plural if the title contains "pics" or "videos" (plural) form of media.
  • date_added - the date the entry was added to the database

The other table, reddit_status contains just two colums:

  • last_id - the last id in the reddit table which the generator script used for generating the site
  • last_run - date the of last successful run of the generator script

This script is called 'db_inserter.pl'. It does not take any arguments but has one constant which has to be changed before using. This constant, DATABASE_PATH, defined the path to SQLite database. As I mentioned, it is allowed for the database not to exist, this script will create one on the first invocation.

These two scripts used together can now be periodically run from crontab to monitor the reddit's front page and insert the links in the database. It can be done with as simple command as:

reddit_extractor.pl 1 | db_inserter.pl

Script 'db_inserter.pl' ca be viewed here:
db inserter (perl script, reddit media generator)

Now that we have our data, we just need to display it in a nice manner. That's the job of generator script.

The generator script will be run after the previous two scripts have been run together and it will use information in the database to build static HTML pages.

Since generating static pages is computationally expensive, the generator has to be smart enough to minimize regeneration of already generated pages. I commented the algorithm (pretty simple algorithm) that minimizes regeneration script carefully, you can take a look at 'generate_pages' function in the source.

The script generates three kinds of pages at the moment - pages containing all pictures and videos, pages containing just pictures and pages containing just videos.

There is a lot of media featured on reddit and as the script keeps things cached, the directory sizes can grow pretty quickly. If a file system which performs badly with thousands of files in a single directory is used, the runtime of the script can degrade. To avoid this, the generator stores cached reddit posts in subdirectories based on the first char of their file name. For example, if a filename of a cached file is 'foo.bar', then it stores the file in /f/foo.bar directory.

The other thing this script does is locate thumbnail images for media. For example, for YouTube videos, it would construct URL to their static thumbnails. For Google Video I could not find a public service for easily getting the thumbnail. The only way I found to get a thumbnail of Google Video is to get the contents of the actual video page and extract it from there. The same applies to many other video sites which do not tell developers how to get the thumbnail of the video. Because of this I had to write a Perl module 'ThumbExtractor.pm', which given a link to a video or picture, extracts the thumbnail.

'ThumbExtractor.pm' module can be viewed here:
thumbnail extractor (perl module, reddit media generator)

Some of the links on reddit contain the link to actual image. I wouldn't want the reddit media site to take long to load, that's why I set out to seek a solution for caching small thumbnails on the server the website is generated.

I had to write another module 'ThumbMaker.pm' which goes and downloads the image, makes a thumbnail image of it and saves to a known path accessible from web server.

'ThumbMaker.pm' module can be viewed here:
thumbnail maker (perl module, reddit media generator)

To manipulate the images (create thumbnails), the ThumbMaker package uses Netpbm open source software.

Netpbm is a toolkit for manipulation of graphic images, including conversion of images between a variety of different formats. There are over 300 separate tools in the package including converters for about 100 graphics formats. Examples of the sort of image manipulation we're talking about are: Shrinking an image by 10%; Cutting the top half off of an image; Making a mirror image; Creating a sequence of images that fade from one image to another.

You will need this software (either compile yourself, or get the precompiled packages) if you want to run the the reddit media website generator scripts!

To use the most common image operations easily, I wrote a package 'Netpbm.pl', which provides operations like resize, cut, add border and others.
'Netpbm.pm' package can be viewed here:
netpbm image manipulation (perl module, reddit media generator)

I hit an interesting problem while developing the ThumbExtractor.pm and ThumbMaker.pm packages - what should they do if the link is to a regular website with just images? There is no simple way to download the right image which the website wanted to show to users.
I thought for a moment and came up with an interesting but simple algorithm which finds "the best" image on the site.
It retrieve ALL the images from the site and find the one with biggest dimensions and make a thumbnail out of it. It is pretty obvious, pictures posted on reddit are big and nice, so the biggest picture on the site must be the one that was meant to be shown.
A more advanced algorithm would analyze it's location on the page and add weigh to the score of how good the image is, depending on where it is located. The more in the center of the screen, the higher score.

For this reason I developed yet another Perl module called 'ImageFinder.pm'. See the 'find_best_image' subroutine to see how it works!

'ImageFinder.pm' module can be viewed here:
best image finder (perl module, reddit media generator)

The generator script also uses CPAN's Template::Toolkit package for generating HTML pages from templates.

The name of the generator script is 'page_gen.pl'. It takes one optional argument 'regenerate' which if specified clears the cache and regenerates all the pages anew. It is useful when templates are updated or changes are made to thumbnail generator.

Program 'page_gen.pl' can be viewed here:
reddit media page generator (perl script)

While developing any piece of software I like solving various problems on paper. For example, with this site I had to solve problem how to regenerate existing pages minimally and how to resize thumbnails so they looked nice.
Here is how the sheet on which I took small notes looked like after the site got published:

reddit media website quick design notes
(sorry for the quality again, i took the picture with camera phone with two shots and stitched it together with image editor)

The final website is at redditmedia.com address (now moved to http://reddit.picurls.com). Click http://reddit.picurls.com to visit it!

Here are all the scripts packed together with basic documentation:

Download Reddit's Media Site Generator Scripts

All the scripts in a single .zip:
Download link: reddit media website generator suite (.zip)
Downloaded: 1760 times

Individual scripts:

reddit_extractor.pl
Download link: reddit extractor (perl script, reddit media generator)
Downloaded: 4720 times

db_inserter.pl
Download link: db inserter (perl script, reddit media generator)
Downloaded: 3384 times

page_gen.pl
Download link: reddit media page generator (perl script)
Downloaded: 3232 times

ThumbExtractor.pm
Download link: thumbnail extractor (perl module, reddit media generator)
Downloaded: 4131 times

ThumbMaker.pm
Download link: thumbnail maker (perl module, reddit media generator)
Downloaded: 3371 times

ImageFinder.pm
Download link: best image finder (perl module, reddit media generator)
Downloaded: 3490 times

NetPbm.pm
Download link: netpbm image manipulation (perl module, reddit media generator)
Downloaded: 3509 times

For newcomers - What is reddit?

For newcomers, reddit is a social news website where users decide its contents.

From their faq:

What is reddit?

A source for what's new and popular on the web -- personalized for you. We want to democratize the traditional model by giving editorial control to the people who use the site, not those who run it. Your votes train a filter, so let reddit know what you liked and disliked, because you'll begin to be recommended links filtered to your tastes. All of the content on reddit is from users who are rewarded for good submissions (and punished for bad ones) by their peers; you decide what appears on your front page and which submissions rise to fame or fall into obscurity.

Have fun with the website and please tell me what do you think about it in the comments! Thanks :)