I was doing a WordPress installation the other day when I noticed how insecure the default generated password was.

On line 38 in wp-admin/includes/upgrade.php (wordpress version 2.3.1) I found that a 6 character password is generated this way:

$random_password = substr(md5(uniqid(microtime())), 0, 6);

The md5 function returns a 32 character hexadecimal number and substr chops off first six characters. Doing elementary combinatorics we can find that the number of possible passwords is 166 (16 to the power 6) or 16,777,216, or roughly just 16.7 million passwords!

I am more than sure that most people doing WP installations never change the default password. If you're on a good connection and can do just 100 password checks per second, then you can crack a WordPress installation in worst case time of 16,777,216/100 seconds, which is 46.6 hours! Most likely you'd crack the password in half of that time, so you can crack any WordPress installation that has a default password in about 24 hours!

charles darvin - natural selectionI have started working on my bachelor's thesis in physics. It's about using genetic algorithms for finding optimal solutions to physics problems. One of the problems I will solving is simulating equilibrium configurations of two dimensional systems of particles which can attract and repel (dipole systems). This problem is NP hard, which means there is no effective algorithm to find the exact solution in an adequate time. Several other algorithms can be used to approximate the solution and find near-equilibrium configurations, one of them being genetic algorithm technique.

The easiest situation is when the particles are bounded in a circle which border they can't trespass. The goal is to find how these particles will position themselves inside this circle.

This case has already been tackled using the simulated annealing method and the near-optimum solutions for hundreds of particles have been found. This method uses different principles than genetic algorithm method which I will describe shortly. The main idea of simulated annealing method is that the system is given an initial temperature which basically controls how much the particles will fluctuate inside the system. The greater the temperature, the bigger the fluctuations. The temperature gradually gets decreased and the fluctuations get smaller and smaller. Calculating the energy of the system and requiring it to be minimal at each step the temperature gets decreased eventually leads to a solution.

Here is how the solutions for 10, 11, 12, 13, 14 and 15 particles looks like, found using simulated annealing algorithms:

dipole system

Notice how interestingly and non-intuitively the particles position themselves in a case when another particle gets added to a system of 12 particles. Instead of positioning the new particle somewhere in the middle as in transition from 11 particles to 12, the system decides to put it on the border.

In my thesis I will be using genetic algorithms to arrive at the solution for this problem.

Here is a general (not related to my topic of thesis) description of what genetic algorithms are.

Genetic Algorithms 101

Genetic algorithms mimic the evolution by natural selection. The basic idea of genetic algorithms is very simple. Genetic algorithms feature populations of individuals which evolve with the use of the principles of selection, variation and inheritance.

One of the ways to implement this idea in computer programs is to represent individuals as strings of binary digits. Each bit in the string represents one gene. Each individual is assigned a numerical evaluation of its merit by a fitness function. The fitness function determines how each gene of an individual will be interpreted and, thus, what specific problem the population will evolve to solve.

Once all individuals in the population have been evaluated, their fitness values are used for selection. Individuals with low fitness get eliminated and the strongest get selected. Inheritance is implemented by making multiple copies of high-fitness individuals. The high-fitness individuals get mutated and they crossover to produce a new population of individuals. Mutation is implemented as flipping individual bits in the binary string representation of an individual and crossover happens as an exchange of binary substrings of two individuals to obtain a new offspring.

By transforming the previous set of individuals to a new one, the algorithm generates a new set of individuals that have better fitness than the previous set of individuals. When the transformations are applied over and over again, the individuals in the population tend to represent improved solutions to whatever problem was posed in the fitness function.

Here is an illustration how a genetic algorithm might work:

operation of the genetic algorithm programming

I look forward to finding what other interesting projects that I can make using the very useful information I have learnt.

PS. I am writing blog posts less often now because I am really, really busy with studies. Sorry about that.

bash readline emacs editing mode default keyboard shortcut cheat sheetWhen you are working in a shell you certainly don't want to waste your time using arrow keys or home/end keys to navigate around the command line. One of the most popular shells, bash - Bourne Again SHell, uses GNU's Readline library for reading the command line.

The GNU Readline library provides a set of functions for use by applications that allow users to edit command lines as they are typed in. The readline library also includes functions to maintain a list of previously-entered command lines, to recall and perhaps reedit those lines, and perform csh-like history expansion on previous commands. Both emacs and vi editing modes are available.

I have mastered both of the editing modes and have created cheat sheets for both of them (and a tiny separate one for readline's history expansion).

This is a cheat sheet for the default, emacs, editing mode.

Here are a few examples with screenshots on how to use this editing mode.

Let '[]' be the position of cursor in all the examples.

Example 1: movement basics

Suppose you are at the end of the line and want to move 3 words backwards.

$ echo word1 word2 word3 word4 word5 word6[]

If you hit M-3 followed by M-b, you would end up exactly where you wanted:

$ echo word1 word2 word3 []word4 word5 word6

An alternative is to hit M-b three times in a row: M-b M-b M-b

If you look up on the cheat sheet what M-3 does, it sets the numeric-argument to 3 which in this case acts as a counter how many times should M-b command be repeated. The M-b command calls backward-word function which does the obvious.

The numeric-argument can also be negative, which makes the argument to be applied in the opposite direction.

Other shortcuts of interest are M-f to move forward and C-a, and C-e to move to the beginning and end of line.

Example 2: command history

Suppose you used a pretty complex command a while ago and now you remember just a few arguments of this command. You want to find this command and call it with a few arguments modified.

If you hit C-r readline will put you in an incremental reverse history search mode. Typing a part of the arguments you remember, will locate the previously executed command matching the typed text. Hitting C-r again will locate any other command which matches your typed text.

To put the found command on command line for editing hit C-j.

Example 3: completing

Suppose you want to quickly list all the users on the system.

Hit C-x ~ and read-line will attempt username completion and output all the usernames to the terminal.

$ []
adm        catonmat   ftp        halt       mailnull   nobody     root       smmsp      vcsa
apache     cpanel     games      lp         mysql      nscd       rpc        sshd
bin        daemon     gopher     mail       named      operator   rpm        sync
cat        dbus       haldaemon  mailman    news       picurls    shutdown   uucp
$ []

Suppose you now want to quickly list all the users on the system starting with 'm'. You can type 'm' followed by the same C-x ~ to do that.

$ m[]
mail      mailman   mailnull  mysql
$ m[]

The other interesting completions are:

  • C-x / which lists possible filename completion,
  • C-x $ which lists possible bash variable completion,
  • C-x @ which lists possible hostname completion and,
  • C-x ! which lists possible command completion.


  • Meta-/ which does filename completion,
  • Meta-$ which does bash variable completion,
  • Meta-@ which does hostname completion and,
  • Meta-! which does command completion.

Example 3: killing and yanking

Suppose you have to type a-long-word-like-this a couple of times.

The easiest way to do this is to kill the word, which puts it into the kill ring. Contents of the kill ring can be accessed by yanking.

For example, type 'a-long-word-like-this' in the shell:

$ command a-long-word-like-this []

Now press C-w to kill one word backward:

$ command []

Press C-y to yank (paste) the word as many times as you wish (I pressed it 3 times here:)

$ command a-long-word-like-this a-long-word-like-this a-long-word-like-this []

The kill ring does not contain just the one latest killing. It can be filled with a number of kills and rotated with M-y shortcut.

Another example:

Suppose you typed a longer command and you noticed that part of the THE TEXT GOT TYPED IN CAPITAL LETTERS. Without knowing the readline shortcuts you would erase the text and probably type it again. Now you can use the readline keyboard shortcuts and change the case very, very quickly.

You can use the following shortcuts to accomplish this:

1) M-l (Meta-l (on your computer, probably ESC-l)) shortcut is bound to readline's downcase-word function which lowercases the current word.
2) M-b shortcut is bound to readline's backward-word function which moves the cursor one word backwards.
3) M-<number> shortcut is bound to readline's numeric-argument function which in some cases acts as how many times should the following command be repeated.

Here is a real word example, suppose we have typed the following ([] is the cursor):


To get to the beginning of 'THE' we might repetitively hit M-b seven times or we could set the numeric argument to seven by typing M-7 and then hit M-b once.

After doing this the cursor would have moved before the word 'THE':


Now, by setting the numerical argument to 7 again and by pressing M-l or by pressing M-l seven times, we turn the text all in lower case.

$ echo the text. the text got typed in capital letters[]

Actually what we did in this example was not as efficient as it could have been. The numeric-argument shortcut accepts negative arguments which turn the direction of the following command in other direction. We could have turned the text in lower case by hitting M--7 and M-l

If you really want to be more productive, I suggest you play around with the commands in the cheat sheet for a while.
My previous article on being more productive on the command line was screen's cheat sheet which allows to emulate multiple terminals in a single window. You can take a look at it as well!

Download Emacs Editing Mode Cheat Sheet

PDF format (.pdf):
Download link: readline emacs cheat sheet (.pdf)
Downloaded: 73825 times

ASCII .txt format:
Download link: readline emacs cheat sheet (.txt)
Downloaded: 17203 times

LaTeX format (.tex):
Download link: readline emacs cheat sheet (.tex)
Downloaded: 9629 times

This cheat sheet is released under GNU Free Document License.

The next cheat sheet will be readline's vi editing mode's default keyboard shortcut cheat sheet!

extract mp3 audio track from youtube videoA few days ago my blog reader, Ankush Agarwal, on the comments of downloading youtube videos with gawk article asked:

I've seen tools available to download just the audio from a youtube video, in various formats; but as per your explanation it seems, that the audio is integrated with the video in the .swf file. How can we extract only the audio part and have it converted to a format like mp3?

As I have written a few articles before on how to download YouTube videos with Perl, gawk and VBScript, and how to convert the downloaded flash video files (flv) to divx or xvid, or any other format with ffmpeg, it was very easy to help this guy.

This is a guide that explains how to extract audio tracks from any videos, not just YouTube.

First, lets download the ffmpeg tool (that's for Windows Operating System. If you are using linux operating system, you can get the ffmpeg tool as a package distribution) and open the ffmpeg documentation in another window.

Lets choose a sample video which we will extract the audio track from. I found some music video clip "My Chemical Romance - Famous Last Words" (http://www.youtube.com/watch?v=8bbTtPL1jRs).

Now, lets download the music video. If you are on a windows machine, you may use my VBScript program to download the video (download vbscript youtube video downloader, read how to use it here), or if you are on linux, you may use gawk program to download the video (download gawk youtube video downloader, read how to use it here).

After downloading the video, I ended up with a file named My_Chemical_Romance_-_Famous_Last_Words.flv.

Once you have downloaded the video, just for the sake of interest, lets find out the audio quality of this You Tube audio video.
The ffmpeg documentation does not tell us about a switch which would just output the audio parameters of the input file. After experimenting a little with the ffmpeg tool, it can be found that by just specifying '-i' switch and the input video file, the ffmpeg will output input streams information and quit.

Here is an example of how it looks:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv

Seems that stream 1 comes from film source: 1000.00 (1000/1) -> 24.00 (24/1)
Input #0, flv, from 'My_Chemical_Romance_-_Famous_Last_Words.flv':
  Duration: 00:04:27.4, start: 0.000000, bitrate: 64 kb/s
  Stream #0.0: Audio: mp3, 22050 Hz, mono, 64 kb/s
  Stream #0.1: Video: flv, yuv420p, 320x240, 24.00 fps(r)
Must supply at least one output file

From this information (2nd line in bold) we can read that the audio bitrate of a YouTube video is 64kbit/s, sampling rate is 22050Hz, the encoding is mp3, and it's a mono audio.

You will be surprised how easy it is to extract the audio part as it is in the video. By just typing:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv famous_last_word.mp3

the ffmpeg tool will extract it to an mp3 audio file!

That's it! After running this command you should have 'famous_last_words.mp3' file in the same folder/directory where the downloaded video file was!

We can go a little further and look up various audio switches on the documentation of ffmpeg. For example, if we had some fancy alarm clock which can be stuffed an mp3, you might not need the whole 64kbit/s of bitrate. You might want to convert the audio to a lower bitrate, say 32kbit/s.

The Section 3.5 - Audio Options of the ffmpeg documentation says:

`-ab bitrate' - Set the audio bitrate in bit/s (default = 64k).

So, by specifying a command line switch '-ab 32k' the audio will be converted to a lower bitrate of 32kbit/s.

Here is the example of running this command:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv -ab 32k famous_last_word.32kbit.mp3
Seems that stream 1 comes from film source: 1000.00 (1000/1) -> 24.00 (24/1)
Input #0, flv, from 'My_Chemical_Romance_-_Famous_Last_Words.flv':
  Duration: 00:04:27.4, start: 0.000000, bitrate: 64 kb/s
  Stream #0.0: Audio: mp3, 22050 Hz, mono, 64 kb/s
  Stream #0.1: Video: flv, yuv420p, 320x240, 24.00 fps(r)
Output #0, mp3, to 'famous_last_word.32kbit.mp3':
  Stream #0.0: Audio: mp3, 22050 Hz, mono, 32 kb/s
Stream mapping:
  Stream #0.0 -> #0.0
size=    1045kB time=267.6 bitrate=  32.0kbits/s
video:0kB audio:1045kB global headers:0kB muxing overhead 0.000000%

The line in bold indicates that the output audio indeed was at a bitrate of 32kbit/s.

Some other things you can do are - changing the codec of the audio (-acodec option (find all codecs with -formats option)) or cut out a part of the audio (-t and -ss options) you are interested in.

This technique actually involved re-encoding the audio which was already in the movie file. If you read closely the audio option documentation, you will find that the -acodec option says:

`-acodec codec' - Force audio codec to codec. Use the copy special value to specify that the raw codec data must be copied as is.

If the input video file was from YouTube or it already had mp3 audio stream, then using the following command line, the audio will be extracted much, much faster:

c:\> ffmpeg.exe -i My_Chemical_Romance_-_Famous_Last_Words.flv -acodec copy famous_last_words.mp3

Have fun ripping your favorite music off YouTube!

ps. Do you have something cool and useful you would like to accompish but do not have the necessary computer skills? Let me know in the comments and I will see if I can write an article about it!

This article is part of the article series "Creating picurls.com."
<- previous article next article ->

The making of picurls.com, the Popurls for Pictures: buzziest picsThis is part II of II of the article how picurls.com was created. In part one we made a universal plugin-based website scraper in Perl to get posts from social news and social bookmarking websites.

In this part I will describe the database design and how the user interface was done in PHP and Smarty template and caching engine.

Picurls.com launches:
picurls - picture buzz, buzziest pics on the net

Database design

I chose SQLite database for this project because of its simplicity - the whole database is a single file and its excellent performance and memory footprint. The only thing I am concerned with is concurrency issues.

SQLite FAQ says:

SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business.

However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.

If picurls gets really popular people might have start getting database errors when posting comments or updating their profiles. I certainly do not want to create such a negative user experience.

The database uses the most simple SQL constructs and thus can can easily be moved to a client/server database engine if something bad happens.

Here is how the database scheme of the first version of picurls looks:

  title      STRING  NOT NULL,
  sane_title STRING  NOT NULL,
  url        STRING  NOT NULL,
  thumb      STRING  NOT NULL,
  site_id    INTEGER NOT NULL,
  date_added DATE    NOT NULL,
  visible    BOOL    NOT NULL DEFAULT 1

CREATE TABLE tmp_items (
  title      STRING  NOT NULL,
  url        STRING  NOT NULL,
  date_added DATE    NOT NULL,
  site_id    INTEGER NOT NULL,

CREATE TABLE comments (
  comment        STRING  NOT NULL,
  item_id        INTEGER NOT NULL,
  user_id        STRING  NOT NULL,
  anonymous_name STRING,
  ip_address     STRING  NOT NULL,
  date_added     DATE    NOT NULL

  visible   BOOL    NOT NULL DEFAULT 1,
  priority  INTEGER NOT NULL

  password    STRING NOT NULL,
  data        STRING,
  ip_address  STRING NOT NULL,
  date_regged DATE   NOT NULL,
  date_access DATE   NOT NULL,
  can_login   BOOL   NOT NULL DEFAULT 1

CREATE INDEX IDX_sites_sane_name       on sites(sane_name);
CREATE INDEX IDX_sites_priority        on sites(priority);
CREATE INDEX IDX_items_site_id         on items(site_id);
CREATE INDEX IDX_items_date_added      on items(date_added);
CREATE INDEX IDX_items_sane_title      on items(sane_title);
CREATE INDEX IDX_comments_item_id      on comments(item_id);
CREATE INDEX IDX_comments_user_id      on comments(user_id);
CREATE INDEX IDX_comments_date_added   on comments(date_added);
CREATE INDEX IDX_comments_item_user_ip on comments(item_id, user_id, ip_address);
CREATE INDEX IDX_users_username        on users(username);

INSERT INTO sites (name, sane_name, url, priority) VALUES('Digg',        'digg',        'http://www.digg.com',        1);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Reddit',      'reddit',      'http://reddit.com',          2);
INSERT INTO sites (name, sane_name, url, priority) VALUES('del.icio.us', 'delicious',   'http://del.icio.us',         3);
INSERT INTO sites (name, sane_name, url, priority) VALUES('StumbleUpon', 'stumbleupon', 'http://www.stumbleupon.com', 4);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Flickr',      'flickr',      'http://www.flickr.com',      5);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Simpy',       'simpy',       'http://www.simpy.com',       6);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Furl',        'furl',        'http://www.furl.net',        7);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Boing Boing', 'boingboing',  'http://www.boingboing.net',  8);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Wired',       'wired',       'http://www.wired.com',       9);

INSERT INTO users (id, username, password, ip_address, date_regged, date_access, can_login) VALUES (0, 'anonymous', 'x', '', '1970-01-01 00:00:00', '1970-01-01 00:00:00', 0);

As I mentioned in part one I want the whole project to be reusable in the future (I already have an idea what I will fork from this project).

You might have noticed that the database schema almost does not contain fields specific to picurls (except 'thumb' field in items and tmp_items tables).

Here is a very brief description of the tables:

  • items - contains links to pictures to be displayed on the front page of picurls.
  • tmp_itmes - contains links to possible pictures which the scraper (see part one of this article) found on social bookmarking/social news sites.
  • omments - contains user comments.
  • sites - contains information about sites picurls is collecting pictures from.
  • users - contains registered user infromation.

If SQLite becomes unsuitable for picurls at some point, I can just dump the database and use almost the same database schema (maybe changing a few field types) in MySQL or PostgreSQL.

User Interface Design

I chose PHP programming language as the server side language for the user interface of picurls. One of the reasons is that it is one of the most popular programming language and as I am releasing the full source code of picurls, I expect someone to help me with adding features or just spotting bugs :)

The logic behind handling requests of the user interface works similar to the web.py framework. First we define the URL structure, and specify which scripts will handle which request URLs.

Here is an example of picurl's URL structure:

$pages = Array(
    '#^/(?:index\.(?:html|php))?$#'  => 'page-index.php',       # main page handler
    '#^/site/(\w+)(?:-(\d+))?.html#' => 'page-site.php',        # site handler (digg, reddit etc)
    '#^/item/([a-z0-9-]+).html#'     => 'page-item.php',        # single item handler
    '#^/login.html#'                 => 'page-login.php',       # login page handler
    '#^/register.html#'              => 'page-register.php',    # registration page handler
    '#^/logout.html#'                => 'page-logout.php',      # logout page handler
    '#^/my-comments(?:-(\d+))?.html#'=> 'page-my-comments.php', # my comment page handler
    '#^/my-profile.html#'            => 'page-my-profile.php'   # my profile page handler

For example, a request to 'http://picurls.com/site/digg-3.html' would get handled by 'page-site.php' script. The value '3' (page number) would get saved, so the page-site.php knew which page number got requested.

Each request to the server gets handled by the default webserver index file - index.php. To have it this way I set up mod_rewrite to rewrite URLs to index.php:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

It is very important to have human readable URLs, such as 'http://picurls.com/item/animals-really-can-get-along-pictures.html' and not something like 'http://picurls.com/item.php?id=37'. This way the website will rank better on search engines and people will easier find what they were looking for.

I noticed that people coming from search engines to digpicz and redditmedia have most often searched for a set of keywords they remembered from the picture.

Here is the index.php script that handles the URL requests and dispatches them to appropriate scripts:
See picurls.com url structure handler script (index.php) (downloaded: 9233 times)

And here is the page-site.php script which handles requests for pictures from particular site (such as digg or stumbleupon):
See picurls.com site handler script (page-site.php) (downloaded: 7477 times)

The actual contents of the pages get served by Smarty templating and caching framework. It is a good practice to separate application logic and content from its presentation. Read the about Smarty page for an example why it's a good practice if you have not done it before.

As I went with a dynamic (PHP) solution for serving contents and I expect the website to become quite popular and I am on a server with limited resources, I needed to find a good and fast way to display contents of the website. Smarty has exactly what I am looking for - support for caching.

The first version of picurls.com does not do caching dynamically (based on content change), instead it just caches pages for some constant time and then flushes cache until the next request.

That's about it. :)

Please ask in the comments if you want a more detailed explanation of some software components!

Download Picurls Website Source Code

All the scripts in a single .zip:
Download link: picurls.com full source code
Downloaded: 1219 times

The source code is released under GNU General Public License.

If you use the source and create your own picurls-like website you must link back to my (this) blog http://www.catonmat.net and you must link back to http://picurls.com!

The .zip archive contains several subdirectories:

  • cache - directory where Smarty keeps cached pages.
  • db - directory where the SQLite database is kept. I included a sample database with 90 pictures - 10 pictures from each of the sites picurls.com takes contents from (digg, reddit, delicious, flickr, stumbleupon, simpy, furl, wired and boing boing).
  • locks - directory where scripts hold their lockfiles to ensure single copy of scraper/thumbnail generator scripts are running at any given time
  • scraper - website data-miner/scraper program (see part one of this article for more information).
  • scripts - scripts which call scraper, insert data in the database and generate thumbnails.
  • templates - HTML templates for Smarty.
  • templates_c - directory where Smarty keeps compiled templates.
  • www - main website directory, containing all the PHP scripts, CSS, images and thumbnails (around 90 thumbnails for each item in the sample database).

To get your own picurls website running, you will have to configure config.php file in www directory where you need to specify full path to the sqlite database (in db directory). That's just the user interface, though. No new items will ever be retrieved from websites because you have to submit the scraper up. The instructions would take another article. If you want to try, though, look at cronjob.sh shell script in scripts directory. Running this script periodically will scrape websites for new posts which look like images posts and try to insert them in the database (you will have to change some constants in picurls_db_inserter.pl script). After this script has run, pic_mover.pl script has to be run (also needs constants changed).