This is part II of II of the article how picurls.com was created. In part one we made a universal plugin-based website scraper in Perl to get posts from social news and social bookmarking websites.
In this part I will describe the database design and how the user interface was done in PHP and Smarty template and caching engine.
Database design
I chose SQLite database for this project because of its simplicity - the whole database is a single file and its excellent performance and memory footprint. The only thing I am concerned with is concurrency issues.
SQLite FAQ says:
SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business.
However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.
If picurls gets really popular people might have start getting database errors when posting comments or updating their profiles. I certainly do not want to create such a negative user experience.
The database uses the most simple SQL constructs and thus can can easily be moved to a client/server database engine if something bad happens.
Here is how the database scheme of the first version of picurls looks:
CREATE TABLE items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title STRING NOT NULL,
sane_title STRING NOT NULL,
url STRING NOT NULL,
thumb STRING NOT NULL,
site_id INTEGER NOT NULL,
date_added DATE NOT NULL,
visible BOOL NOT NULL DEFAULT 1
);
CREATE TABLE tmp_items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title STRING NOT NULL,
url STRING NOT NULL,
date_added DATE NOT NULL,
site_id INTEGER NOT NULL,
tries INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE comments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
comment STRING NOT NULL,
item_id INTEGER NOT NULL,
user_id STRING NOT NULL,
anonymous_name STRING,
ip_address STRING NOT NULL,
date_added DATE NOT NULL
);
CREATE TABLE sites (
id INTEGER PRIMARY KEY,
name STRING NOT NULL UNIQUE,
sane_name STRING NOT NULL UNIQUE,
url STRING NOT NULL UNIQUE,
visible BOOL NOT NULL DEFAULT 1,
priority INTEGER NOT NULL
);
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
username STRING NOT NULL UNIQUE,
password STRING NOT NULL,
data STRING,
ip_address STRING NOT NULL,
date_regged DATE NOT NULL,
date_access DATE NOT NULL,
can_login BOOL NOT NULL DEFAULT 1
);
CREATE INDEX IDX_sites_sane_name on sites(sane_name);
CREATE INDEX IDX_sites_priority on sites(priority);
CREATE INDEX IDX_items_site_id on items(site_id);
CREATE INDEX IDX_items_date_added on items(date_added);
CREATE INDEX IDX_items_sane_title on items(sane_title);
CREATE INDEX IDX_comments_item_id on comments(item_id);
CREATE INDEX IDX_comments_user_id on comments(user_id);
CREATE INDEX IDX_comments_date_added on comments(date_added);
CREATE INDEX IDX_comments_item_user_ip on comments(item_id, user_id, ip_address);
CREATE INDEX IDX_users_username on users(username);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Digg', 'digg', 'http://www.digg.com', 1);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Reddit', 'reddit', 'http://reddit.com', 2);
INSERT INTO sites (name, sane_name, url, priority) VALUES('del.icio.us', 'delicious', 'http://del.icio.us', 3);
INSERT INTO sites (name, sane_name, url, priority) VALUES('StumbleUpon', 'stumbleupon', 'http://www.stumbleupon.com', 4);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Flickr', 'flickr', 'http://www.flickr.com', 5);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Simpy', 'simpy', 'http://www.simpy.com', 6);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Furl', 'furl', 'http://www.furl.net', 7);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Boing Boing', 'boingboing', 'http://www.boingboing.net', 8);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Wired', 'wired', 'http://www.wired.com', 9);
INSERT INTO users (id, username, password, ip_address, date_regged, date_access, can_login) VALUES (0, 'anonymous', 'x', '0.0.0.0', '1970-01-01 00:00:00', '1970-01-01 00:00:00', 0);
As I mentioned in part one I want the whole project to be reusable in the future (I already have an idea what I will fork from this project).
You might have noticed that the database schema almost does not contain fields specific to picurls (except ‘thumb’ field in items and tmp_items tables).
Here is a very brief description of the tables:
- items - contains links to pictures to be displayed on the front page of picurls.
- tmp_itmes - contains links to possible pictures which the scraper (see part one of this article) found on social bookmarking/social news sites.
- omments - contains user comments.
- sites - contains information about sites picurls is collecting pictures from.
- users - contains registered user infromation.
If SQLite becomes unsuitable for picurls at some point, I can just dump the database and use almost the same database schema (maybe changing a few field types) in MySQL or PostgreSQL.
User Interface Design
I chose PHP programming language as the server side language for the user interface of picurls. One of the reasons is that it is one of the most popular programming language and as I am releasing the full source code of picurls, I expect someone to help me with adding features or just spotting bugs
The logic behind handling requests of the user interface works similar to the web.py framework. First we define the URL structure, and specify which scripts will handle which request URLs.
Here is an example of picurl’s URL structure:
$pages = Array(
'#^/(?:index\.(?:html|php))?$#' => 'page-index.php', # main page handler
'#^/site/(\w+)(?:-(\d+))?.html#' => 'page-site.php', # site handler (digg, reddit etc)
'#^/item/([a-z0-9-]+).html#' => 'page-item.php', # single item handler
'#^/login.html#' => 'page-login.php', # login page handler
'#^/register.html#' => 'page-register.php', # registration page handler
'#^/logout.html#' => 'page-logout.php', # logout page handler
'#^/my-comments(?:-(\d+))?.html#'=> 'page-my-comments.php', # my comment page handler
'#^/my-profile.html#' => 'page-my-profile.php' # my profile page handler
);
For example, a request to ‘http://picurls.com/site/digg-3.html‘ would get handled by ‘page-site.php’ script. The value ‘3′ (page number) would get saved, so the page-site.php knew which page number got requested.
Each request to the server gets handled by the default webserver index file - index.php. To have it this way I set up mod_rewrite to rewrite URLs to index.php:
RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L]
It is very important to have human readable URLs, such as ‘http://picurls.com/item/animals-really-can-get-along-pictures.html‘ and not something like ‘http://picurls.com/item.php?id=37′. This way the website will rank better on search engines and people will easier find what they were looking for.
I noticed that people coming from search engines to digpicz and redditmedia have most often searched for a set of keywords they remembered from the picture.
Here is the index.php script that handles the URL requests and dispatches them to appropriate scripts:
See picurls.com url structure handler script (index.php) (downloaded: 426 times)
And here is the page-site.php script which handles requests for pictures from particular site (such as digg or stumbleupon):
See picurls.com site handler script (page-site.php) (downloaded: 378 times)
The actual contents of the pages get served by Smarty templating and caching framework. It is a good practice to separate application logic and content from its presentation. Read the about Smarty page for an example why it’s a good practice if you have not done it before.
As I went with a dynamic (PHP) solution for serving contents and I expect the website to become quite popular and I am on a server with limited resources, I needed to find a good and fast way to display contents of the website. Smarty has exactly what I am looking for - support for caching.
The first version of picurls.com does not do caching dynamically (based on content change), instead it just caches pages for some constant time and then flushes cache until the next request.
That’s about it.
Please ask in the comments if you want a more detailed explanation of some software components!
Also, a forums have been setup for all my projects! Visit catonmat forums or picurls forums!
All the scripts in a single .zip:
Download link: picurls.com full source code
Downloaded: 468 times
The source code is released under GNU General Public License.
If you use the source and create your own picurls-like website you must link back to my (this) blog http://www.catonmat.net and you must link back to http://picurls.com!
The .zip archive contains several subdirectories:
- cache - directory where Smarty keeps cached pages.
- db - directory where the SQLite database is kept. I included a sample database with 90 pictures - 10 pictures from each of the sites picurls.com takes contents from (digg, reddit, delicious, flickr, stumbleupon, simpy, furl, wired and boing boing).
- locks - directory where scripts hold their lockfiles to ensure single copy of scraper/thumbnail generator scripts are running at any given time
- scraper - website data-miner/scraper program (see part one of this article for more information).
- scripts - scripts which call scraper, insert data in the database and generate thumbnails.
- templates - HTML templates for Smarty.
- templates_c - directory where Smarty keeps compiled templates.
- www - main website directory, containing all the PHP scripts, CSS, images and thumbnails (around 90 thumbnails for each item in the sample database).
To get your own picurls website running, you will have to configure config.php file in www directory where you need to specify full path to the sqlite database (in db directory). That’s just the user interface, though. No new items will ever be retrieved from websites because you have to submit the scraper up. The instructions would take another article. If you want to try, though, look at cronjob.sh shell script in scripts directory. Running this script periodically will scrape websites for new posts which look like images posts and try to insert them in the database (you will have to change some constants in picurls_db_inserter.pl script). After this script has run, pic_mover.pl script has to be run (also needs constants changed).
Hosting
ZigZap Technologies have kindly provided me with a server hosting for all my needs. If you are looking for hosting solutions, they are excellent!
Enjoy!
Did you like this post? Subscribe to my posts!

(9 votes, average: 4.11 out of 5)
|
|
|


October 15th, 2007 at 2:03 am
Great script, brilliant concept and giving us all the source code is very cool way of riding the new web 2.0.
October 15th, 2007 at 2:43 am
You’re welcome, BHDomainer
Home someone makes some changes to my scripts and makes something more cool out of it 
October 16th, 2007 at 2:45 pm
[…] about linux news. Or one about programming. I leave you to think for what else. Here and here you can find out how that idea was put into practice and the story behind the […]
October 16th, 2007 at 10:46 pm
[…] In loc sa cautati pe digg, reddit si alte site-ul de “social networking” poze mai bine vizitati www.picurls.com - site generat automat care cauta pe mai multe astfel de site-uri care strang articole de pe net. Este un proiect interesant - chiar ma gandeam la o asemenea idee - sa strangi de pe site-urile de tipul “social networking” articolele dintr-o singura categorie. Deocamdata site-ul asta stranage poze, dar se preconizeaza sa se mearga pe ideea asta mai mult. Cel care a programat picurls.com a facut un bine omenirii si a publicat sursele proiectului si metoda prin care a facut site-ul. Sursele si povestea din spatele proiectului pot fi citite in doua posturi de pe blogul autorului proiectului aici si aici. […]
October 18th, 2007 at 2:58 am
Thank you for sharing the source code.
October 18th, 2007 at 4:45 pm
[…] criou dois artigos em que explica a metodologia adoptada no desenvolvimento do PicURLs (Parte I ; Parte II) e, também como de costume, coloca para download - fundo da página - o código fonte do projecto. […]
October 19th, 2007 at 5:21 pm
[…] In loc sa cautati pe digg, reddit si alte site-ul de “social networking” poze mai bine vizitati www.picurls.com - site generat automat care cauta pe mai multe astfel de site-uri care strang articole de pe net. Este un proiect interesant - chiar ma gandeam la o asemenea idee - sa strangi de pe site-urile de tipul “social networking” articolele dintr-o singura categorie. Deocamdata site-ul asta stranage poze, dar se preconizeaza sa se mearga pe ideea asta mai mult. Cel care a programat picurls.com a facut un bine omenirii si a publicat sursele proiectului si metoda prin care a facut site-ul. Sursele si povestea din spatele proiectului pot fi citite in doua posturi de pe blogul autorului proiectului aici si aici. […]
October 22nd, 2007 at 3:18 pm
[…] full story here […]
October 27th, 2007 at 2:26 pm
[…] read more | digg story […]
November 2nd, 2007 at 6:54 pm
[…] din spatele proiectului pot fi citite in doua posturi de pe blogul autorului proiectului aici si aici. Elemente: Creativitate, Programare, […]
November 4th, 2007 at 2:24 pm
[…] read more | digg story […]
November 10th, 2007 at 6:17 pm
And here’s the facebook application based on your code..
http://apps.facebook.com/buzzwire/
Enjoy
November 10th, 2007 at 11:58 pm
well seems you got a db problem on it at the moment.. digg-effect?
November 11th, 2007 at 8:45 am
you should add http://ffffound.com to your list of sites on picurls
January 3rd, 2008 at 11:46 pm
thanks , excellent post
April 19th, 2008 at 6:17 pm
Hello! Good site, much useful
July 9th, 2008 at 4:23 pm
tests time mashine