Computer Science July 07, 2009

# On the Linear Time Algorithm For Finding Fibonacci Numbers

In this article I'd like to show how the theory does not always match the practice. I am sure you all know the linear time algorithm for finding Fibonacci numbers. The analysis says that the running time of this algorithm is O(n). But is it still O(n) if we actually run it? If not, what is wrong?

Let's start with the simplest linear time implementation of the Fibonacci number generating algorithm in Python:

```def LinearFibonacci(n):
fn = f1 = f2 = 1
for x in xrange(2, n):
fn = f1 + f2
f2, f1 = f1, fn
return fn
```

The theory says that this algorithm should run in O(n) - given the n-th Fibonacci number to find, the algorithm does a single loop up to n.

Now let's verify if this algorithm is really linear in practice. If it's linear then the plot of n vs. running time of LinearFibonacci(n) should be a line. I plotted these values for n up to 200,000 and here is the plot that I got:

Note: Each data point was averaged over 10 calculcations.

Oh no! This does not look linear at all! It looks quadratic! I fitted the data with a quadratic function and it fit nearly perfectly. Do you know why the seemingly linear algorithm went quadratic?

The answer is that the theoretical analysis assumed that all the operations in the algorithm executed in constant time. But this is not the case when we run the algorithm on a real machine! As the Fibonacci numbers get larger, each addition operation for calculating the next Fibonacci number "fn = f1 + f2 " runs in time proportional to the length of the previous Fibonacci number. It's because these huge numbers no longer fit in the basic units of computation in the CPU; so a big integer library is required. The addition of two numbers of length O(n) in a big integer library takes time of O(n).

I'll show you that the running time of the real-life linear Fibonacci algorithm really is O(n^2) by taking into account this hidden cost of bigint library.

So at each iteration i we have a hidden cost of O(number of digits of fi) = O(digits(fi)). Let's sum these hidden cost for the whole loop up to n:

Now let's find the number of digits in the n-th Fibonacci number. To do that let's use the well-known Binet's formula, which tells us that the n-th Fibonacci number fn can be expressed as:

It is also well-known that the number of digits in a number is integer part of log10(number) + 1. Thus the number of digits in the n-th Fibonacci number is:

Thus if we now sum all the hidden costs for finding the n-th Fibonacci number we get:

There we have it. The running time of this "linear" algorithm is actually quadratic if we take into consideration that each addition operation runs proportionally to the length of addends.

Next time I'll show you that if the addition operation runs in constant time, then the algorithm is truly linear; and later I will do a similar analysis of the logarithmic time algorithm for finding Fibonnaci numbers that uses this awesome matrix identity:

Don't forget to subscribe if you are interested! It's well worth every byte!

Programming June 30, 2009

# Low Level Bit Hacks You Absolutely Must Know

I decided to write an article about a thing that is second nature to embedded systems programmers - low level bit hacks. Bit hacks are ingenious little programming tricks that manipulate integers in a smart and efficient manner. Instead of performing some operation (such as counting the 1 bits in an integer) by looping over individual bits, these programming nuggets do the same with one or two carefully chosen bitwise operations.

To get things going I'll assume that you know what the two's complement binary representation of an integer is and also that you know all the the bitwise operations.

I'll use the following notation for bitwise operations in the article:

```&   -  bitwise and
|   -  bitwise or
^   -  bitwise xor
~   -  bitwise not
<<  -  bitwise shift left
>>  -  bitwise shift right
```

The numbers in the article are 8 bit signed integers (though the operations work on arbitrary length signed integers) that are represented as two's complement and they are usually named 'x'. The result is usually 'y'. The individual bits of 'x' are named b7, b6, b5, b4, b3, b3, b2, b1 and b0. The bit b7 is the sign bit (the most significant bit), and b0 is the least significant.

I'll start with the most basic bit hacks and gradually progress to more difficult ones. I'll use examples to explain how each bithack works.

If you are intrigued by this topic I urge you to subscribe to my blog. I can share a secret that there will be the 2nd part of this article where I cover more advanced bit hacks, and I will also release a cheat sheet with all these bit tricks! It's well worth subscribing!

Here we go.

Bit Hack #1. Check if the integer is even or odd.

```if ((x & 1) == 0) {
x is even
}
else {
x is odd
}
```

I am pretty sure everyone has seen this trick. The idea here is that an integer is odd if and only if the least significant bit b0 is 1. It follows from the binary representation of 'x', where bit b0 contributes to either 1 or 0. By AND-ing 'x' with 1 we eliminate all the other bits than b0. If the result after this operation is 0, then 'x' was even because bit b0 was 0. Otherwise 'x' was odd.

Let's look at some examples. Let's take integer 43, which is odd. In binary 43 is 00101011. Notice that the least significant bit b0 is 1 (in bold). Now let's AND it with 1:

```    00101011
&   00000001   (note: 1 is the same as 00000001)
--------
00000001
```

See how AND-ing erased all the higher order bits b1-b7 but left bit b0 the same it was? The result is thus 1 which tells us that the integer was odd.

Now let's look at -43. Just as a reminder, a quick way to find negative of a given number in two's complement representation is to invert all bits and add one. So -43 is 11010101 in binary. Again notice that the last bit is 1, and the integer is odd. (Note that if we used one's complement it wouldn't be true!)

Now let's take a look at an even integer 98. In binary 98 is 1100010.

```    01100010
&   00000001
--------
00000000
```

After AND-ing the result is 0. It means that the bit b0 of original integer 98 was 0. Thus the given integer is even.

Now the negative -98. It's 10011110. Again, bit b0 is 0, after AND-ing, the result is 0, meaning -98 is even, which indeed is true.

Bit Hack #2. Test if the n-th bit is set.

```if (x & (1<<n)) {
n-th bit is set
}
else {
n-th bit is not set
}
```

In the previous bit hack we saw that (x & 1) tests if the first bit is set. This bit hack improves this result and tests if n-th bit is set. It does it by shifting that first 1-bit n positions to the left and then doing the same AND operation, which eliminates all bits but n-th.

Here is what happens if you shift 1 several positions to the left:

```1         00000001    (same as 1<<0)
1<<1      00000010
1<<2      00000100
1<<3      00001000
1<<4      00010000
1<<5      00100000
1<<6      01000000
1<<7      10000000
```

Now if we AND 'x' with 1 shifted n positions to the left we effectively eliminate all the bits but n-th bit in 'x'. If the result after AND-ing is 0, then that bit must have been 0, otherwise that bit was set.

Let's look at some examples.

Does 122 have 3rd bit set? The operation we do to find it out is:

```122 & (1<<3)
```

Now, 122 is 01111010 in binary. And (1<<3) is 00001000.

```    01111010
&   00001000
--------
00001000
```

We see that the result is not 0, so yes, 122 has the 3rd bit set.

Note: In my article bit numeration starts with 0. So it's 0th bit, 1st bit, ..., 7th bit.

What about -33? Does it have the 5th bit set?

```    11011111      (-33 in binary)
&   00100000     (1<<5)
--------
00000000
```

Result is 0, so the 5th bit is not set.

Bit Hack #3. Set the n-th bit.

```y = x | (1<<n)
```

This bit hack combines the same (1<<n) trick of setting n-th bit by shifting with OR operation. The result of OR-ing a variable with a value that has n-th bit set is turning that n-th bit on. It's because OR-ing any value with 0 leaves the value the same; but OR-ing it with 1 changes it to 1 (if it wasn't already). Let's see how that works in action:

Suppose we have value 120, and we wish to turn on the 2nd bit.

```    01111000    (120 in binary)
|   00000100    (1<<2)
--------
01111100
```

What about -120 and 6th bit?

```    10001000   (-120 in binary)
|   01000000   (1<<6)
--------
11001000
```

Bit Hack #4. Unset the n-th bit.

```y = x & ~(1<<n)
```

The important part of this bithack is the ~(1<<n) trick. It turns on all the bits except n-th.

Here is how it looks:

```~1        11111110  (same as ~(1<<0))
~(1<<1)   11111101
~(1<<2)   11111011
~(1<<3)   11110111
~(1<<4)   11101111
~(1<<5)   11011111
~(1<<6)   10111111
~(1<<7)   01111111
```

The effect of AND-ing variable 'x' with this quantity is eliminating n-th bit. It does not matter if the n-th bit was 0 or 1, AND-ing it with 0 sets it to 0.

Here is an example. Let's unset 4th bit in 127:

```    01111111    (127 in binary)
&   11101111    (~(1<<4))
--------
01101111
```

Bit Hack #5. Toggle the n-th bit.

```y = x ^ (1<<n)
```

This bit hack also uses the wonderful "set n-th bit shift hack" but this time it XOR's it with the variable 'x'. The result of XOR-ing something with something else is that if both bits are the same, the result is 0, otherwise it's 1. How does it toggle n-th bit? Well, if n-th bit was 1, then XOR-ing it with 1 changes it to 0; conversely, if it was 0, then XOR-ing with with 1 changes it to 1. See, the bit got flipped.

Here is an example. Suppose you want to toggle 5th bit in value 01110101:

```    01110101
^   00100000
--------
01010101
```

What about the same value but 5th bit originally 0?

```    01010101
^   00100000
--------
01110101
```

Notice something? XOR-ing the same bit twice returned it to the same value. This nifty XOR property is used in calculating parity in RAID arrays and used in simple cryptography cyphers, but more about that in some other article.

Bit Hack #6. Turn off the rightmost 1-bit.

```y = x & (x-1)
```

Now it finally gets more interesting!!! Bit hacks #1 - #5 were kind of boring to be honest.

This bit hack turns off the rightmost one-bit. For example, given an integer 00101010 (the rightmost 1-bit in bold) it turns it into 00101000. Or given 00010000 it turns it into 0, as there is just a single 1-bit.

Here are more examples:

```    01010111    (x)
&   01010110    (x-1)
--------
01010110

01011000    (x)
&   01010111    (x-1)
--------
01010000

10000000    (x = -128)
&   01111111    (x-1 = 127 (with overflow))
--------
00000000

11111111    (x = all bits 1)
&   11111110    (x-1)
--------
11111110

00000000    (x = no rightmost 1-bits)
&   11111111    (x-1)
--------
00000000
```

Why does it work?

If you look at the examples and think for a while, you'll realize that there are two possible scenarios:

1. The value has the rightmost 1 bit. In this case subtracting one from it sets all the lower bits to one and changes that rightmost bit to 0 (so that if you add one now, you get the original value back). This step has masked out the rightmost 1-bit and now AND-ing it with the original value zeroes that rightmost 1-bit out.

2. The value has no rightmost 1 bit (all 0). In this case subtracting one underflows the value (as it's signed) and sets all bits to 1. AND-ing all zeroes with all ones produces 0.

Bit Hack #7. Isolate the rightmost 1-bit.

```y = x & (-x)
```

This bit hack finds the rightmost 1-bit and sets all the other bits to 0. The end result has only that one rightmost 1-bit set. For example, 01010100 (rightmost bit in bold) gets turned into 00000100.

Here are some more examples:

```    10111100  (x)
&   01000100  (-x)
--------
00000100

01110000  (x)
&   10010000  (-x)
--------
00010000

00000001  (x)
&   11111111  (-x)
--------
00000001

10000000  (x = -128)
&   10000000  (-x = -128)
--------
10000000

11111111  (x = all bits one)
&   00000001  (-x)
--------
00000001

00000000  (x = all bits 0, no rightmost 1-bit)
&   00000000  (-x)
--------
00000000
```

This bit hack works because of two's complement. In two's complement system -x is the same as ~x+1. Now let's examine the two possible cases:

1. There is a rightmost 1-bit bi. In this case let's pivot on this bit and divide all other bits into two flanks - bits to the right and bits to the left. Remember that all the bits to the right bi-1, bi-2 ... b0 are 0's (because bi was the rightmost 1-bit). And bits to the left are the way they are. Let's call them bi+1, ..., bn.

Now, when we calculate -x, we first do ~x which turns bit bi into 0, bits bi-1 ... b0 into 1s, and inverts bits bi+1, ..., bn, and then we add 1 to this result.

Since bits bi-1 ... b0 are all 1's, adding one makes them carry this one all the way to bit bi, which is the first zero bit.

If we put it all together, the result of calculating -x is that bits bi+1, ..., bn get inverted, bit bi stays the same, and bits bi-1, ..., b0 are all 0's.

Now, AND-ing x with -x makes bits bi+1, ..., bn all 0, leaves bit bi as is, and sets bits bi-1, ..., b0 to 0. Only one bit is left, it's the bit bi - the rightmost 1-bit.

2. There is no rightmost 1-bit. The value is 0. The negative of 0 in two's complement is also 0. 0&0 = 0. No bits get turned on.

We have proved rigorously that this bithack is correct.

Bit Hack #8. Right propagate the rightmost 1-bit.

```y = x | (x-1)
```

This is best understood by an example. Given a value 01010000 it turns it into 01011111. All the 0-bits right to the rightmost 1-bit got turned into ones.

This is not a clean hack, tho, as it produces all 1's if x = 0.

Let's look at more examples:

```    10111100  (x)
|   10111011  (x-1)
--------
10111111

01110111  (x)
|   01110110  (x-1)
--------
01110111

00000001  (x)
|   00000000  (x-1)
--------
00000001

10000000  (x = -128)
|   01111111  (x-1 = 127)
--------
11111111

11111111  (x = -1)
|   11111110  (x-1 = -2)
--------
11111111

00000000  (x)
|   11111111  (x-1)
--------
11111111
```

Let's prove it, though not as rigorously as in the previous bithack (as it's too time consuming and this is not a scientific publication). There are two cases again. Let's start with easiest first.

1. There is no rightmost 1-bit. In that case x = 0 and x-1 is -1. -1 in two's complement is 11111111. OR-ing 0 with 11111111 produces the same 11111111. (Not the desired result, but that's the way it is.)

2. There is the rightmost 1-bit bi. Let's divide all the bits in two groups again (like in the previous example). Calculating x-1 modifies only bits to the right, turning bi into 0, and all the lower bits to 1's. Now OR-ing x with x-1 leaves all the higher bits (to the left) the same, leaves bit bi as it was 1, and since lower bits are all low 1's it also turns them on. The result is that the rightmost 1-bit got propagated to lower order bits.

Bit Hack #9. Isolate the rightmost 0-bit.

```y = ~x & (x+1)
```

This bithack does the opposite of #7. It finds the rightmost 0-bit, turns off all bits, and sets this bit to 1 in the result. For example, it finds the zero in bold in this number 10101011, producing 00000100.

More examples:

```    10111100  (x)
--------
01000011  (~x)
&   10111101  (x+1)
--------
00000001

01110111  (x)
--------
10001000  (~x)
&   01111000  (x+1)
--------
00001000

00000001  (x)
--------
11111110  (~x)
&   00000010  (x+1)
--------
00000010

10000000  (x = -128)
--------
01111111  (~x)
&   10000001  (x+1)
--------
00000001

11111111  (x = no rightmost 0-bit)
--------
00000000  (~x)
&   00000000  (x+1)
--------
00000000

00000000  (x)
--------
11111111  (~x)
&   00000001  (x+1)
--------
00000001
```

Proof: Suppose there is a rightmost 0-bit. Then ~x turns this rightmost 0 bit into 1 bit. And so does x+1 (because bits more right to the rightmost 0 bit are 1's). Now AND-ing ~x with x+1 evaporates all the bits up to this rightmost 0 bit. This is the highest order bit set in the result. Now what about lower order bits to the right of rightmost 0 bit? They also got evaporated because because x+1 turned them into 0's (they were 1's) and ~x turned them into 0's. They got AND-ed with 0 and evaporated.

Bit Hack #10. Turn on the rightmost 0-bit.

```y = x | (x+1)
```

This hack changes the rightmost 0-bit into 1. For example, given an integer 10100011 it turns it into 10100111.

More examples:

```    10111100  (x)
|   10111101  (x+1)
--------
10111101

01110111  (x)
|   01111000  (x+1)
--------
01111111

00000001  (x)
|   00000010  (x+1)
--------
00000011

10000000  (x = -128)
|   10000001  (x+1)
--------
10000001

11111111  (x = no rightmost 0-bit)
|   00000000  (x+1)
--------
11111111

00000000  (x)
|   00000001  (x+1)
--------
00000001
```

Here is the proof as a bunch of true statements. OR-ing x with x+1 does not lose any information. Adding 1 to x fills the first rightmost 0. The result is max{x, x+1}. If x+1 overflows it's x and there were no 0 bits. If it doesn't, it's x+1 which just got rightmost bit filled with 1.

## Bonus stuff.

If you decide to play more with these hacks, here are a few utility functions to print binary values of 8 bit signed integers in Perl, Python and C.

Print binary representation in Perl:

```sub int_to_bin {
my \$num = shift;
print unpack "B8", pack "c", \$num;
}
```

Or you can print it from command line right away:

```perl -wle 'print unpack "B8", pack "c", shift' <integer>

# For example:
perl -wle 'print unpack "B8", pack "c", shift' 113
01110001

perl -wle 'print unpack "B8", pack "c", shift' -- -128
10000000
```

Print binary number in Python:

```def int_to_bin(num, bits=8):
r = ''
while bits:
r = ('1' if num&1 else '0') + r
bits = bits - 1
num = num >> 1
print r
```

Print binary representation in C:

```void int_to_bin(int num) {
char str[9] = {0};
int i;
for (i=7; i>=0; i--) {
str[i] = (num&1)?'1':'0';
num >>= 1;
}
printf("%s\n", str);
}
```

Have fun with these! I'll write about advanced bit hacks some time soon. If you are really intrigued by this topic I encourage you to subscribe to my blog. Thanks!

Ps. Let me know in the comments what you think about this article, and let me know if you do not know what two's complement, or the basic binary operations are. If there are a few people who would like me to explain these concepts, I'll be glad to write another article just about these fundamental topics.

Pps. There is a book entirely on bit hacks like these. It's called "Hacker's Delight". It may be worth getting if you are into this stuff.

Misc May 15, 2009

# How I Got Hired by Plurk.com

Hi everyone! I have awesome news! I have been hired by plurk.com! It's actually been almost two months since I am with Plurk and so far it has been great. It's an interesting story on how I got hired and what I do there. I'll tell about it in more details in this post. I got hired thanks to this blog!

But before I do that, let me briefly explain what Plurk is about. Plurk is a startup that focuses on online conversations. It's similar to Twitter, but done completely differently and if I may say it's done right. You send a message to a friend, come back after five hours and you can continue the conversation right there. Others can join the conversation at any time.

This is how Plurk looks:

I just answered to user's shlomif plurk about his new Perl module.

I think it was a short and sweet explanation of what Plurk is. Visit www.plurk.com to try it, and add me as your friend to start having real fun on Plurk! My nickname on Plurk is "pkrumins".

Now let's move on to the essence of the article:

## How I Got Hired by Plurk.com

I first heard about Plurk on July 26th, exactly one month after I graduated. Kan, the founder of Plurk, had found my article on MySQL Tuning and decided to contact me. He had browsed my blog and he liked the good melange of different things that I posted about.

The way he approached me was completely different than of any other people interested in talking with me or hiring me. First thing that surprised me was that he found me via Skype and Google Talk at the same time. He probably did that to make sure he gets in touch with me. I had never given out my Skype username/gtalk email (but it's not hard to find), yet Kan took the time to find it. The second thing that surprised me was that he had also researched on me and found things like my polyphasic sleep experiment, which we discussed at length in one of our conversations.

With other people the discussions would usually take place over email, we'd send each other a few emails and then the discussion would die off, because I was always busy and I did not have much time to go into long discussions over email. The email conversations would usually end with "let's keep in touch".

With Kan it all took a different route. In our first conversation over Skype I explained that I was very happy that I had just graduated and that I had made a list of things to do on my own, and was not very interested in working for anyone. He said it was ok and that he just wanted to chit-chat and see what I was up to. Our first chat session lasted just one hour, but we managed to talk about a dozen of different things -- we talked about my current activities and my future plans on how I want to be a great hacker. Then we talked about Plurk's success, how it had become 5th fastest growing website on the net, and how it was praised to be the best communication service on the web. We also chatted about one of world's top hackers, and how useless the Brainbench certificates are (I had 52 of them), we discussed the people who work at Plurk and their accomplishments.

We'd continue chatting like this every few weeks, sometimes even day after day. Kan would always tell me something interesting. One time we'd talk about great generalist hackers like Max Levchin, Bob Ippolito and Gabriel Weinberg. The next time we'd talk how all the Plurk employees telecommute and how other companies do it. Then we'd talk about crazy Silicon Valley people who use startup drugs like Provigil. I could keep on going here forever.

I got more and more intrigued about Plurk but then, suddenly, one day Google sent me an email that they were interesting in hiring me. I told Kan about it but he was not very disappointed. Instead, our discussions shifted towards Google. Well, long story short, I did not get hired by Google and Kan had hypnotized me with his wisdom so I agreed to work at Plurk. Plus he wrote this:

```[2008.09.07 08:41:15] Kan says: anyway, in the unlikely event Google
turns you down, just name your price and you have it. (or take googles
offer and multiply by 1.5) Best of luck :)
```

Can't turn down loads of cash. And so I got hired on January 15th -- almost 6 months later!

The lesson that I learned from Kan's method of hiring is to keep interest in the person you want to hire and occasionally talk with him/her, ask what he's up to, and tell some interesting things that you are expert in. If I ever gonna create my own startup, I'll use Kan's method of hiring.

## Job Interview

Before getting hired I had a job interview at Plurk. It was done in a group chat with Kan and the previous Plurk's sysadmin Dima. The interview was more sysadmin oriented and it took 2.5 hours total.

Here are the job interview questions:

• What OS'es and distributions are you experienced with?
• If you have access to a Linux server, how would you identify if it's a 32bit or a 64bit system?
• What process in Linux has ID equal 1?
• What can you tell about /proc in Linux?
• What is forking?
• What are zombie processes?
• Do zombie processes use system resources?
• How would look at what a particular process is doing right now?
• If you have a process in D state, how would you kill it?
• How do you install software in general?
• How would you rebuilt a debian package?
• What if you have a server with a long uptime, 1 year let's say, and it has lots of files on a huge partition, and you need to reboot it. How would you make sure it doesn't check the filesystem on reboot?
• How does traceroute work?
• What IP address can be the default gateway for host 192.168.1.82/24?
• What is the bridge interface?
• What is the default password of MySQL?
• How does the MySQL client connect when the target host is localhost?
• How would you properly manage the deletion of old binary log files in MySQL?
• How would you list current bin logs and delete only few of them properly?
• How would you adjust the size and the interval after which old logs are purged?
• What is the difference between 'set <var>' and 'set global <var>'?
• Sometimes MySQL will throw an error showing only some numerical value, how to check what does the number mean?
• How to effectively increase the maximum number of tables MySQL can have open at one time?
• If someone set the password for a 'root' account in mysql and no one knows it, how would you recover it?
• How would you backup MySQL?
• If you have a mysqldump utility and you have to dump a database consisting of InnoDB tables, large like 20Gb, how would you do that?
• How would you back up a running MySQL instance that is being used by application, if MySQL data files are on the LVM volume and you need to make sure application is functional during the backup process?
• What if you have to copy the data from MySQL master server to slave first, what is important step there?
• What functionality does LVM snapshot provide?
• What kind of RAID do you know?
• What is there any difference between hardware and software RAID?
• Which utility is used to manage software RAID in Linux?
• What hardware RAID controllers have you dealt with?
• Have you ever heard of BBU?
• How to safely upgrade kernel remotely if you are not 100% sure that new kernel will boot successfully?
• Have you ever used IPMI cards?

At this point I was totally tired and we ended the interview here. I got most of them right, except some MySQL questions, some LVM questions and some RAID questions. I did not feel well about not knowing answers to those, but after the interview Kan told me that he enjoyed the interview a lot. I was happy. :)

## What do I do at Plurk?

My official job title is "hacker extraordinaire" and I am allowed to hack on anything that interests me at Plurk. So far I have managed to hack mostly on two things, but I have also helped with several smaller sysadmin things.

My first task at Plurk was to expand the collaborative translation platform that Plurk already had. Plurk has hundreds of thousands of users from all over the world and the collective intelligence they create can be used in many fascinating ways. One of the ways is to use their help to translate Plurk more dynamically. Instead of just asking users to translate new static strings, the system I created asks users to translate them on the fly. It can be used, for example, to translate system messages. Before a system message is sent to users, it's first sent to translators, and only after they translated it, it's sent off to all the users. If you are on Plurk and would like to help us with translations, visit "Plurk Collaborative Translation Project" to join!

While doing this project, I learned a bunch of new tools like ExtJS, Mako templates and Werkzeug.

My current task is to implement a real-time search engine for Plurk. As a great coder who reuses software, I am using a combination of Tokyo Dystopia and Sphinx to implement it. I am allowed to blog about the things I do, so expect a few articles on real-time search from me soon!

I have also prepared an article on a tool that I wrote that splits mysqldump into tables and imports N tables in parallel (it's quicker on multiprocessor machine to import several tables in parallel in mysql).

Please subscribe to my blog to get all these tasty posts automatically!

Btw, the article about pipe viewer utility originated from me helping Plurk's sysadmin to backup a 500GB database.

## People at Plurk

We are just six that run Plurk. We all work remotely and each one of us is in a different country.

• Kan is the founder and overlord of Plurk and he lives in Canada.
• Alvin is a co-founder of Plurk, and the UI and design guy. He currently resides in Taiwan.
• Amir is also a co-founder of Plurk and he works as the lead developer of Plurk. He's originally from Bosnia but now lives in Denmark. He has the most elegant todo manager on the web todoist. Check it out!
• Gleber is Erlang ninja. His home is Poland.
• Ryan is our beloved sysadmin from USA.

We are all very friendly and help out each other a lot. When we have some serious issues we all come to a group chat on IRC to solve them together. Otherwise we communicate via private plurks and Skype.

## Open Source at Plurk

All of the Plurk is built using open source software -- Linux, MySQL, Python, Nginx, Haproxy, Nagios, Cacti, Memcached, Memcachedb, Tokyo Tyrant, and these are just the big names that I can remember off my head.

As a result we also want to give back to open source community and we have set up Plurk dev labs:

We have released our first project there called "LightCloud", that builds on top of Tokyo Tyrant to make it scale horizontally.

## What Do People Think About Plurk? A Case Study.

After a few weeks at Plurk, I started inviting many of my FreeNode #perl friends to Plurk. Some of them were very skeptical and said things like "eww, another social website". Still, I somehow managed to get them on and suddenly their view changed. Here is what they said about Plurk after trying it for a couple of hours:

Shlomi Fish, a well known Israeli hacker:

```[18:12:01] <pkrumins> how do you like plurk?
[18:12:46] <rindolf> So far, it's great.
[18:12:51] <rindolf> I think I'm addicted.

# click here for his plurk profile: <a href="http://www.plurk.com/shlomif">shlomif</a>
```

Kent Fredric, a Perl hacker wannabee from New Zealand:

```[17:29:29] <pkrumins> so plurk seems nice?
[17:29:54] <kent\n> definitely, its probably the first
web-based chat medium which makes any sense

# click here for his plurk profile: <a href="http://www.plurk.com/kentfredric">kentfredric</a>
```

f00li5h, a Perl overlord from Australia:

```[18:24:16] <pkrumins> like plurks?
[18:24:44] <f00li5h> better than twitters

# click here for his plurk profile: <a href="http://www.plurk.com/f00li5h">f00li5h</a>
```

Overall we are having tremendous fun at Plurk, all the Perl guys are schmoozing together, discussing everything about programming, cats and turtles. Come join us: Altreus, Caelum, iank, jawnsy, simcop2387, SubStack, whoppix and apeiron.

Here is a widget of my most recent Plurks:

Finally, if you are interested in working with me and other great hackers at Plurk, see the Jobs page and send your cv to jobs@plurk.com or kan@plurk.com!

Happy times! =\$_\$=

Projects April 15, 2009

# Python Library for Searching Adwords

Here is another quick hack that I wrote a while ago. It complements the xgoogle library that I published in my previous post with an API for Google Sponsored Links search.

Let me quickly explain why this library is useful, and what the Google Sponsored Links are.

For a typical search, Google shows regular web search results on the left side of the page, and "Sponsored Links" in a column on the right side. "Sponsored" means the results are pulled from Googe's advertising network (Adwords).

Here is a screenshot that illustrates the Sponsored Links:

Google Sponsored Links results for search term "security" are in red.

Okay, now why would I need a library to search the Sponsored results? Suppose that I am an advertiser on Adwords, and I buy some software related keywords like "video software". It is in my interests to know my competitors, their advertisement text, what are they up to, the new players in this niche, and their websites. Without my library it would be practically impossible to keep track of all the competitors. There can literally be hundreds of changes per day. However, with my library it's now piece of cake to keep track of all the dynamics.

## How does the library work?

The sponsored links library pulls the results from this URL: http://www.google.com/sponsoredlinks. Here is an example of all the sponsored results for a query "security":

Sponsored links results for "security".

The library just grabs page after page, calls BeautifulSoup, and extracts the search result elements. Elementary.

## How to use the library?

As I mentioned, this library is part of my xgoogle library. Download and extract it first:

Download: xgoogle library (.zip)
Downloaded: 33312 times.
Download url: http://www.catonmat.net/download/xgoogle.zip

Now, the source file that contains the implementation of this library is "xgoogle/sponsoredlinks.py". To use it, do the usual import "from xgoogle.sponsoredlinks import SponsoredLinks, SLError".

SponsoredLinks is the class that provides the API and SLError is exception class that gets thrown in case of errors, so it's a good idea to import both.

The SponsoredLinks has a similar interface as the xgoogle.search (the plain google search module). The constructor of SponsoredLinks takes the keyword you want to search for, and the constructed object has several public methods and properties:

• method get_results() - gets a page of results, returning a list of SponsoredLink objects. It returns an empty list if there are no more results.
• property num_results - returns number of search results found.
• property results_per_page - sets/gets the number of results to get per page (max 100).

The returned SponsoredLink objects have four attributes -- "title", "desc", "url", and "display_url". Here is a picture that illustrates what each attribute stands for:

The picture does not show the "display_url" attribute as it's the actual link the result links to (href of blue link in the pic).

Here is an example usage of this library. It retrieves first 100 Sponsored Links results for keyword "video software":

```from xgoogle.sponsoredlinks import SponsoredLinks, SLError
try:
sl = SponsoredLinks("video software")
sl.results_per_page = 100
results = sl.get_results()
except SLError, e:
print "Search failed: %s" % e

for result in results:
print result.title.encode('utf8')
print result.desc.encode('utf8')
print result.display_url.encode('utf8')
print result.url.encode('utf8')
print
```

Output:

```Photoshop Video Software
Time saving software for video. Work faster in Photoshop.
www.toolsfortelevision.com
http://www.toolsfortelevision.com

...
```

That's about it for this time. Use it to find your competitors and outsmart them!

Next time I am going to expand the library for Google Sets search.

Download "xgoogle" library:

Download: xgoogle library (.zip)
Downloaded: 33312 times.
Download url: http://www.catonmat.net/download/xgoogle.zip

Have fun!

Projects March 12, 2009

# Python Library for Google Search

Here is a quick hack that I wrote. It's a Python library to search Google without using their API. It's quick and dirty, just the way I love it.

Why didn't I use Google's provided REST API? Because it says "you can only get up to 8 results in a single call and you can't go beyond the first 32 results". Seriously, what am I gonna do with just 32 results?

I wrote it because I want to do various Google hacks automatically, monitor popularity of some keywords and sites, and to use it for various other reasons.

One of my next post is going to extend on this library and build a tool that perfects your English. I have been using Google for a while to find the correct use of various English idioms, phrases, and grammar. For example, "i am programmer" vs. "i am a programmer". The first one is missing an indefinite article "a", but the second is correct. Googling for these terms reveal that the first has 6,230 results, but the second has 136,000 results, so I pretty much trust that the 2nd is more correct than the first.

Subscribe to my posts via catonmat's rss, if you are intrigued and would love to receive my posts automatically!

## How to use the library?

First download the xgoogle library, and extract it somewhere.

Download: xgoogle library (.zip)
Downloaded: 33312 times.
Download url: http://www.catonmat.net/download/xgoogle.zip

At the moment it contains just the code for Google search, but in the future I will add other searches (google sets, google suggest, etc).

To use the search, from "xgoogle.search" import "GoogleSearch" and, optionally, "SearchError".

GoogleSearch is the class you will use to do Google searches. SearchError is an exception class that GoogleSearch throws in case of various errors.

Pass the keyword you want to search as the first parameter to GoogleSearch's constructor. The constructed object has several public methods and properties:

• method get_results() - gets a page of results, returning a list of SearchResult objects. It returns an empty list if there are no more results.
• property num_results - returns number of search results found.
• property results_per_page - sets/gets the number of results to get per page. Possible values are 10, 25, 50, 100.
• property page - sets/gets the search page.

As I said, get_results() method returns a SearchResult object. It has three attributes -- "title", "desc", and "url". They are Unicode strings, so do a proper encoding before outputting them.

Here is a screenshot that illustrates the "title", "desc", and "url" attributes:

Google search result for "catonmat".

Here is an example program of doing a Google search. It takes the first argument, does a search on it, and prints the results:

```from xgoogle.search import GoogleSearch, SearchError
try:
gs = GoogleSearch("quick and dirty")
gs.results_per_page = 50
results = gs.get_results()
for res in results:
print res.title.encode("utf8")
print res.desc.encode("utf8")
print res.url.encode("utf8")
print
except SearchError, e:
print "Search failed: %s" % e
```

This code fragment sets up a search for "quick and dirty" and specifies that a result page should have 50 results. Then it calls get_results() to get a page of results. Finally it prints the title, description and url of each search result.

Here is the output from running this program:

```Quick-and-dirty - Wikipedia, the free encyclopedia
Quick-and-dirty is a term used in reference to anything that is an easy way to implement a kludge. Its usage is popular among programmers, ...
http://en.wikipedia.org/wiki/Quick-and-dirty

Grammar Girl's Quick and Dirty Tips for Better Writing - Wikipedia ...
"Grammar Girl's Quick and Dirty Tips for Better Writing" is an educational podcast that was launched in July 2006 and the title of a print book that was ...Writing - 39k -
http://en.wikipedia.org/wiki/Grammar_Girl%27s_Quick_and_Dirty_Tips_for_Better_Writing

Quick & Dirty Tips :: Grammar  Girl
Quick & Dirty Tips(tm) and related trademarks appearing on this website are the property of Mignon Fogarty, Inc. and Holtzbrinck Publishers Holdings, LLC. ...
http://grammar.quickanddirtytips.com/
[...]
```

Compare these results to the output above.

You could also have specified which search page to start the search from. For example, the following code will get 25 results per page and start the search at 2nd page.

```gs = GoogleSearch("quick and dirty")
gs.results_per_page = 25
gs.page = 2
results = gs.get_results()
```

You can also quickly write a scraper to get all the results for a given search term:

```from xgoogle.search import GoogleSearch, SearchError
try:
gs = GoogleSearch("quantum mechanics")
gs.results_per_page = 100
results = []
while True:
tmp = gs.get_results()
if not tmp: # no more results were found
break
results.extend(tmp)
# ... do something with all the results ...
except SearchError, e:
print "Search failed: %s" % e
```

You can use this library to constantly monitor how your website is ranking for a given search term. Suppose your website has a domain "catonmat.net" and the search term you want to find your position for is "python videos".

Here is a code that outputs your ranking: (it looks through first 100 results, if you need more, put a loop there)

```import re
from urlparse import urlparse
from xgoogle.search import GoogleSearch, SearchError

target_domain = "catonmat.net"
target_keyword = "python videos"

def mk_nice_domain(domain):
"""
convert domain into a nicer one (eg. www3.google.com into google.com)
"""
domain = re.sub("^www(\d+)?\.", "", domain)
# add more here
return domain

gs = GoogleSearch(target_keyword)
gs.results_per_page = 100
results = gs.get_results()
for idx, res in enumerate(results):
parsed = urlparse(res.url)
domain = mk_nice_domain(parsed.netloc)
if domain == target_domain:
print "Ranking position %d for keyword '%s' on domain %s" % (idx+1, target_keyword, target_domain)
```

Output of this program:

```Ranking position 6 for keyword python videos on domain catonmat.net
Ranking position 7 for keyword python videos on domain catonmat.net
```

Here is a much wicked example. It uses the GeoIP Python module to find all 10 websites for keyword "wicked code" that are physically hosting in California or New York in USA. Make sure you download GeoCityLite database from "http://www.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz" and extract it to "/usr/local/geo_ip".

```import GeoIP
from urlparse import urlparse
from xgoogle.search import GoogleSearch, SearchError

class Geo(object):
GEO_PATH = "/usr/local/geo_ip/GeoLiteCity.dat"

def __init__(self):
self.geo = GeoIP.open(Geo.GEO_PATH, GeoIP.GEOIP_STANDARD)

def detect_by_host(self, host):
try:
gir = self.geo.record_by_name(host)
return {'country': gir['country_code'].lower(),
'region': gir['region'].lower()}
except Exception, e:
return {'country': 'none', 'region': 'none'}

dst_country = 'us'
dst_states = ['ca', 'ny']
dst_keyword = "wicked code"
num_results = 10
final_results = []
geo = Geo()

gs = GoogleSearch(dst_keyword)
gs.results_per_page = 100

seen_websites = []
while len(final_results) < num_results:
results = gs.get_results()
domains = [urlparse(r.url).netloc for r in results]
for d in domains:
geo_loc = geo.detect_by_host(d)
if (geo_loc['country'] == dst_country and
geo_loc['region'] in dst_states and
d not in seen_websites):
final_results.append((d, geo_loc['region']))
seen_websites.append(d)
if len(final_results) == num_results:
break

print "Found %d websites:" % len(final_results)
for w in final_results:
print "%s (state: %s)" % w
```

Here is the output of running it:

```Found 10 websites:
www.wickedcode.com (state: ca)
www.retailmenot.com (state: ca)
www.simplyhired.com (state: ca)
archdipesh.blogspot.com (state: ca)
wagnerblog.com (state: ca)
answers.yahoo.com (state: ca)
devsnippets.com (state: ca)
friendfeed.com (state: ca)
www.thedacs.com (state: ny)
www.tipsdotnet.com (state: ca)
```

You may modify these examples the way you wish. I'd love to hear some comments about what you can come up with!

And just for fun, here are some other simple uses:

You can make your own Google Fight:

```import sys
from xgoogle.search import GoogleSearch, SearchError

args = sys.argv[1:]
if len(args) < 2:
print 'Usage: google_fight.py "keyword 1" "keyword 2"'
sys.exit(1)

try:
n0 = GoogleSearch('"%s"' % args[0]).num_results
n1 = GoogleSearch('"%s"' % args[1]).num_results
except SearchError, e:
print "Google search failed: %s" % e
sys.exit(1)

if n0 > n1:
print "%s wins with %d results! (%s had %d)" % (args[0], n0, args[1], n1)
elif n1 > n0:
print "%s wins with %d results! (%s had %d)" % (args[1], n1, args[0], n0)
else:
print "It's a tie! Both keywords have %d results!" % n1
```

Download: google_fight.py
Downloaded: 7938 times.
Download url: http://www.catonmat.net/download/google_fight.py

Here is an example usage of google_fight.py:

```\$ ./google_fight.py google microsoft
google wins with 2680000000 results! (microsoft had 664000000)

\$ ./google_fight.py "linux ubuntu" "linux gentoo"
linux ubuntu wins with 4300000 results! (linux gentoo had 863000)
```

After I wrote this, I generalized this Google Fight to take N keywords, and made their passing to program easier by allowing them to be separated by a comma.

```import sys
from operator import itemgetter
from xgoogle.search import GoogleSearch, SearchError

args = sys.argv[1:]
if not args:
print "Usage: google_fight.py keyword one, keyword two, ..."
sys.exit(1)

keywords = [k.strip() for k in ' '.join(args).split(',')]
try:
results = [(k, GoogleSearch('"%s"' % k).num_results) for k in keywords]
except SearchError, e:
print "Google search failed: %s" % e
sys.exit(1)

results.sort(key=itemgetter(1), reverse=True)
for res in results:
print "%s: %d" % res
```

Download: google_fight2.py
Downloaded: 8138 times.
Download url: http://www.catonmat.net/download/google_fight2.py

Here is an example usage of google_fight2.py:

```\$ ./google_fight2.py earth atmospehere, sun atmosphere, moon atmosphere, jupiter atmosphere
earth atmospehere: 685000
jupiter atmosphere: 31400
sun atmosphere: 24900
moon atmosphere: 8130
```

I am going to expand on this library and add search for Google Sets, Google Sponsored Links, Google Suggest, and perhaps some other Google searches. Then I'm going to build various tools on them, like a sponsored links competitor finder, use Google Suggest together with Google Sets to find various phrases in English, and apply them to tens of other my ideas.

Download "xgoogle" library and examples:

Download: xgoogle library (.zip)
Downloaded: 33312 times.
Download url: http://www.catonmat.net/download/xgoogle.zip

Download: google_fight.py
Downloaded: 7938 times.
Download url: http://www.catonmat.net/download/google_fight.py

Download: google_fight2.py
Downloaded: 8138 times.
Download url: http://www.catonmat.net/download/google_fight2.py