Google ChromeAs everyone already knows, Google released a new open-source web browser called Chrome.

Having interest in code reuse, I downloaded the source code and examined all the open-source libraries used.

Google Chrome browser shows excellent example of code reuse. I found that they use at least 25 different software libraries!

Here is the full list of libraries, along with relative paths to source code and short library descriptions. Many of the libraries have been patched by googlers; look for README.google files in each library directory for information about changes.

Library

Relative Path

Description

Google Breakpad

/src/breakpad

An open-source multi-platform crash reporting system.

Google URL

/src/googleurl

A small library for parsing and canonicalizing URLs.

Skia

/src/skia

Vector graphics engine.

Google v8

/src/v8

Google's open source JavaScript engine. V8 implements ECMAScript as specified in ECMA-262, 3rd edition, and runs on Windows XP and Vista, Mac OS X 10.5 (Leopard), and Linux systems that use IA-32 or ARM processors. V8 can run standalone, or can be embedded into any C++ application.

Webkit

/src/webki

Open source web browser engine.

Netscape Portable Runtime (NSPR)

/src/base/third_party/nspr

Netscape Portable Runtime (NSPR) provides a platform-neutral API for system level and libc like functions.

Network Security Services (NSS)

/src/base/third_party/nss

Network Security Services (NSS) is a set of libraries designed to support cross-platform development of security-enabled client and server applications. Applications built with NSS can support SSL v2 and v3, TLS, PKCS #5, PKCS #7, PKCS #11, PKCS #12, S/MIME, X.509 v3 certificates, and other security standards.

Hunspell

/src/chrome/third_
party/hunspell

Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding.

Windows Template Library

/src/chrome/third_party/wtl

C++ library for developing Windows applications and UI components. It extends ATL (Active Template Library) and provides a set of classes for controls, dialogs, frame windows, GDI objects, and more.

Google C++ Testing Framework

/src/testing/gtest

Google's framework for writing C++ tests on a variety of platforms (Linux, Mac OS X, Windows, Windows CE, and Symbian). Based on the xUnit architecture. Supports automatic test discovery, a rich set of assertions, user-defined assertions, death tests, fatal and non-fatal failures, various options for running the tests, and XML test report generation.

bsdiff and bspatch

/src/third_party/bsdiff and /src/third_party/bspatch

bsdiff and bspatch are tools for building and applying patches to binary files.

bzip2

/src/third_party/bzip2

bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding.

International Components for Unicode (ICU)

/src/third_party/icu38

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.

libjpeg

/src/third_party/libjpeg

Library for handling the JPEG (JFIF) image format.

libpng

/src/third_party/libpng

PNG image format library. It supports almost all PNG features, is extensible, and has been extensively tested for over 13 years.

libxml

/src/third_party/libxml

XML C parsing library.

libxslt

/src/third_party/libxslt

XSLT C library.

LZMA

/src/third_party/lzma_sdk

LZMA is the default and general compression method of 7z format in the 7-Zip program.

stringencoders

/src/third_party/modp_b64

A collection of high performance c-string transformations (in this case, base 64 encoding/decoding), frequently 2x faster than standard implementations (if they exist at all).

Netscape Plugin Application Programming Interface (NPAPI)

/src/third_party/npapi

Cross-platform plugin architecture used by many web browsers.

Pthreads-w32

/src/third_party/pthread

Application programming interface (API) for writing multithreaded applications

SCons - a software construction tool

/src/third_party/scons

Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache.

sqlite

/src/third_party/sqlite

Software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.

TLS Lite

/src/third_party/tlslite

Free Python library that implements SSL 3.0, TLS 1.0, and TLS 1.1. TLS Lite supports non-traditional authentication methods such as SRP, shared keys, and cryptoIDs in addition to X.509 certificates. Note: Python is not a part of Chrome. It's used for testing various parts of Chrome browser, such as code coverage, dependencies, measures page load times, compares generated html, etc.

zlib

/src/third_party/zlib

zlib is designed to be a free, general-purpose, legally unencumbered -- that is, not covered by any patents -- lossless data-compression library for use on virtually any computer hardware and operating system.

They have done a really good job making these libraries work together. As someone said, "good coders code, great reuse."

I also found some other exciting things in the source, which I will soon post about. I recommend that you subscribe to my rss feed, if you are interested!

Talking about Chrome, I am waiting for Google to add capability to write extensions for their browser! I already made a list of extensions that I will try to create as soon as they add this feature.

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the fifth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lectures seven and eight, which are on the topic of Hashing.

Many applications require a dynamic set that supports dictionary operations insert, search, and delete. For example, a compiler for a computer language maintains a symbol table, in which the keys of elements are arbitrary strings that correspond to identifiers in the language. A hash table is an effective data structure for implementing dictionaries.

Lectures seven and eight cover various implementation techniques of hash tables and hash functions.

Lecture 7: Hashing I

Lecture seven starts with the symbol-table problem -- given a table S, we'd like to insert an element into S, delete an element from S and search for an element in S. We'd also like these operations to take constant time.

The simplest solution to this problem is to use a direct-access (or direct-address) table. To represent the dynamic set, we use an array, or direct-address table, denoted by T, in which each position, or slot, corresponds to a key.

Using direct-address table, the dictionary operations are trivial to implement.

Direct-Address-Search(T, k)
    return T[k]

Direct-Address-Insert(T, x)
    T[key[x]] = x

Direct-Address-Delete(T, x)
    T[key[x]] = NIL

Direct addressing is applicable when we can afford to allocate an array that has one position for every possible key. It is not applicable when the range of keys can be large (as it requires a lot of space for the array T). This is where hashing comes in.

The lecture continues with explanation of what hashing is. Hashing uses a hash function h(k) that maps keys k randomly into slots of hash-table T. There is one hitch: two keys may hash to the same slot. We call this situation a collision. Fortunately, there are effective techniques for resolving the conflict created by collisions.

One of the simplest collision resolution techniques is called chaining. In chaining, we put all the elements that hash to the same slot in a linked list:

Hashing

Collision resolution by chaining. Each hash-table slot T[j] contains a linked list of all the keys whose hash value is j. For example, h(k1) = h(k4).

Professor Leiserson then analyzes the running time of insert, delete and search operations. It is concluded that the expected running time operations is still O(1), under assumption that the number of hash-table slots is at least proportional to the number of elements in the table.

The other half of the lecture is devoted to hash functions and another way of resolving collisions -- resolving collisions by open addressing, and probing strategies (search) for open addressing -- linear probing and double hashing.

A good hash function should distribute the keys uniformly into the slots of the table and the regularity in the key distributions should not affect uniformity.

Two hash function creating methods are introduced - the division method, which defines h(k) = k mod m, and the multiplication method, where h(k) = (A·k mod 2w)>>(w-r), where w is bits in a word, A is an odd integer between 2w-1 and 2w, and r is lg(m).

You're welcome to watch lecture seven:

Topics covered in lecture seven:

  • [00:30] Symbol-table problem.
  • [02:05] Symbol-table operations: insert, delete, search.
  • [04:35] Direct-address table (direct addressing).
  • [09:45] Hashing.
  • [14:30] Resolving hash function collisions by chaining.
  • [17:05] Worst-case analysis of chaining.
  • [19:15] Average-case analysis of chaning.
  • [29:30] Choosing a hash function.
  • [30:55] Division method hash function.
  • [39:05] Multiplication method hash function.
  • [46:30] Multiplication method explained with a modular wheel.
  • [50:12] Resolving hash function collisions by open addressing.
  • [59:00] Linear probing strategy.
  • [01:01:30] Double hashing probing strategy.
  • [01:04:20] Average-case analysis of open addressing.

Lecture seven notes:

MIT Algorithms Lecture 7 Notes Thumbnail. Page 1 of 2.
Lecture 7, page 1 of 2.

MIT Algorithms Lecture 7 Notes Thumbnail. Page 2 of 2.
Lecture 7, page 2 of 2.

Lecture 8: Hashing II

Lecture eight starts with addressing a weakness of hashing -- for any choice of hash function, there exists a set of keys that all hash to the same value. This weakness can lead to denial of service attacks on the application using hashing.

The idea of addressing this problem is to choose a hash function at random! This is called universal hashing.

The lecture then moves to a mathematically rigorous the definition of universal hashing and explains one of many ways to construct a universal hash function.

The other half of the lecture is devoted to perfect hashing. Perfect hashing solves a problem of constructing a static hash table (such as a hash table stored on a CD), so that searches take O(1) time guaranteed (worst-case). The key idea in creating such hash table is to use 2-level universal hashing, so that no collisions occur in level 2.

Video of lecture eight:

Topics covered in lecture eight:

  • [00:30] Fundamental weakness of hashing.
  • [05:12] Universal hashing.
  • [20:10] Constructing a universal hash function.
  • [49:45] Perfect hashing.
  • [54:30] Example of perfect hashing.
  • [01:06:27] (Markov inequality)
  • [01:14:30] Analysis of storage requirements for perfect hashing.

Lecture eight notes:

MIT Algorithms Lecture 8 Notes Thumbnail. Page 1 of 2.
Lecture 8, page 1 of 2.

MIT Algorithms Lecture 8 Notes Thumbnail. Page 2 of 2.
Lecture 8, page 2 of 2.

Have fun hashing! The next post will be about random binary search trees and red-black trees!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

Linux ifconfig GolfingA friend of mine recently showed me his version of extracting ip address(es) from ifconfig. The ifconfig tool on Linux is used to list IP addresses and configure network interfaces. It is sometimes necessary to capture your IP in a variable, like when writing some firewall rules.

I decided to show off and golf the extraction of IP address with four commonly used tools -- Awk, sed, a version of Perl everyone has, and the latest version of Perl 5.10.

His original version was pretty ugly:

$ ifconfig | perl -ple 'print $_ if /inet addr/ and $_ =~ s/.*inet addr:((?:\d+\.){3}\d+).*/$1/g  ;$_=""' | grep -v ^\s*$

My own version on my firewall was:

$ ifconfig | grep inet | awk -F: '{ print $2 }' | awk '{ print $1 }'

This looks nicer but I was uselessly using grep, and calling awk twice.

Golfing in Perl

My first attempt was to do it with grep and Perl.

$ ifconfig | grep inet | perl -ple '($_) = /addr:([^ ]+)/'
192.168.1.1
127.0.0.1

In a split second I realized that Perl does grep itself:

$ ifconfig | perl -nle '/addr:([^ ]+)/ and print $1'
192.168.1.1
127.0.0.1

Then I noticed that 'dr:' matches the same lines as 'addr:'; also 'and' can be replaced with '&&':

$ ifconfig | perl -nle '/dr:([^ ]+)/ && print $1'
192.168.1.1
127.0.0.1

The regular expression '([^ ]+)' can be replaced with '(\S+)', which matches non-whitespace characters:

$ ifconfig | perl -nle '/dr:(\S+)/ && print $1'
192.168.1.1
127.0.0.1

Cutting the whitespace, the final version for Perl is 37 characters long:

$ ifconfig|perl -nle'/dr:(\S+)/&&print$1'
192.168.1.1
127.0.0.1

Golfing in Perl 5.10

It can be made shorter with Perl 5.10, which introduces the new say function:

$ ifconfig|perl -nE'/dr:(\S+)/&&say$1'
192.168.1.1
127.0.0.1

The result for Perl 5.10 is 34 characters.

Golfing in sed

Then I also tried doing the same with sed:

$ ifconfig | sed -n '/dr:/{;s/.*dr://;s/ .*//;p;}'
192.168.1.1
127.0.0.1

I modified it to strip off everything that was not numbers and dots:

$ ifconfig | sed -n '/dr:/{;s/.*dr:\([0-9.]\+\) .*/\1/;p;}'
192.168.1.1
127.0.0.1

This turned out to be much longer, so I tried getting rid of backslashes, by enabling extended regexes in sed with -r argument:

$ ifconfig | sed -rn '/dr:/{;s/.*dr:([0-9.]+) .*/\1/;p;}'
192.168.1.1
127.0.0.1

I forgot that I had used the ([^ ]+) regex before. Another my friend reminded me this, shortening the sed script to:

$ ifconfig | sed -rn 's/.*r:([^ ]+) .*/\1/p'
192.168.1.1
127.0.0.1

Dropping the whitespace, sed version turned out to be 40 characters long.

Golfing in Awk

My final attempt was to optimize my original Awk version:

$ ifconfig | awk '/dr:/{gsub(/.*:/,"",$2);print$2}'
192.168.1.1
127.0.0.1

Cutting the whitespace, the Awk version is 43 characters.

The Winner

The winner in this golf tournament is Perl 5.10 with 34 chars!

Can You Do Better?

Can you golf this even shorter?

PS. my awk and sed cheat sheets might come handy!

GNU Awk YouTube Downloader RevisitedAround a year ago I wrote a YouTube video downloader in GNU Awk. As I explained, I did it to explore the corners of Awk language.

The key idea in writing this program was to figure out how to do networking in Awk. The original version of Awk does not have any networking capabilities, but it turned out that GNU version of Awk does! There is even a manual on "TCP/IP Networking With Gnu Awk"!

One of my blog readers, Werner Illchmann, suggested that I add a progress bar and sent me a patch. I improved it a little and here is the result:

$ chmod +x get_youtube_vids.awk
$ ./get_youtube_vids.awk http://www.youtube.com/watch?v=4bQOSRm9YiQ
Parsing YouTube video urls/IDs...
Getting video information for video: 4bQOSRm9YiQ...
Downloading: Premature Optimization is the Root of All Evil!...
Saving video to file 'Premature_Optimization_is_the_Root_of_All_Evil.flv' (size: 227.81kb)...
<strong>Done: 85121/233280 bytes (36%, 83.13kb/227.81kb)</strong>

Here is the source code of the program:

#!/usr/bin/gawk -f
#
# 2007.07.10 v1.0 - initial release
# 2007.10.21 v1.1 - youtube changed the way it displays vids
# 2008.03.01 v1.2 - youtube changed the way it displays vids
# 2008.08.28 v1.3 - added a progress bar and removed need for --re-interval 
# 2009.08.25 v1.4 - youtube changed the way it displays vids
#
# Peteris Krumins (peter@catonmat.net)
# http://www.catonmat.net -- good coders code, great reuse
#
# Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ...
# or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1>
#

BEGIN {
    if (ARGC == 1) usage();

    BINMODE = 3

    delete ARGV[0]
    print "Parsing YouTube video urls/IDs..."
    for (i in ARGV) {
        vid_id = parse_url(ARGV[i])
        if (length(vid_id) < 6) { # havent seen youtube vids with IDs < 6 chars
            print "Invalid YouTube video specified: " ARGV[i] ", not downloading!"
            continue
        }
        VIDS[i] = vid_id
    }

    for (i in VIDS) {
        print "Getting video information for video: " VIDS[i] "..."
        get_vid_info(VIDS[i], INFO)

        if (INFO["_redirected"]) {
            print "Could not get video info for video: " VIDS[i]
            continue 
        }

        if (!INFO["video_url"]) {
            print "Could not get video_url for video: " VIDS[i]
            print "Please goto my website, and submit a comment with an URL to this video, so that I can fix it!"
            print "Url: http://www.catonmat.net/blog/downloading-youtube-videos-with-gawk/"
            continue
        }
        if ("title" in INFO) {
            print "Downloading: " INFO["title"] "..."
            title = INFO["title"]
        }
        else {
            print "Could not get title for video: " VIDS[i]
            print "Trying to download " VIDS[i] " anyway"
            title = VIDS[i]
        }
        download_video(INFO["video_url"], title)
    }
}

function usage() {
    print "Downloading YouTube Videos with GNU Awk"
    print
    print "Peteris Krumins (peter@catonmat.net)"
    print "http://www.catonmat.net  --  good coders code, great reuse"
    print 
    print "Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
    print "or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
    exit 1
}

#
# function parse_url
#
# takes a url or an ID of a youtube video and returns just the ID
# for example the url could be the full url: http://www.youtube.com/watch?v=ID
# or it could be www.youtube.com/watch?v=ID
# or just youtube.com/watch?v=ID or http://youtube.com/watch?v=ID
# or just the ID
#
function parse_url(url) {
    gsub(/http:\/\//, "", url)                # get rid of http:// part
    gsub(/www\./,     "", url)                # get rid of www.    part
    gsub(/youtube\.com\/watch\?v=/, "", url)  # get rid of youtube.com... part

    if ((p = index(url, "&")) > 0)      # get rid of &foo=bar&... after the ID
        url = substr(url, 1, p-1)

    return url
}

#
# function get_vid_info
#
# function takes the youtube video ID and gets the title of the video
# and the url to .flv file
#
function get_vid_info(vid_id, INFO,    InetFile, Request, HEADERS, matches, escaped_urls, fmt_urls, fmt) {
    delete INFO
    InetFile = "/inet/tcp/0/www.youtube.com/80"
    Request = "GET /watch?v=" vid_id " HTTP/1.1\r\n"
    Request = Request "Host: www.youtube.com\r\n\r\n"

    get_headers(InetFile, Request, HEADERS)
    if ("Location" in HEADERS) {
        INFO["_redirected"] = 1
        close(InetFile)
        return
    }

    # fix this bug:
    # http://www.youtube.com/watch?v=nb1u7wMKywM
    while ((InetFile |& getline) > 0) {
        if (match($0, /"fmt_url_map": "([^"]+)"/, matches)) {
            escaped_urls = url_unescape(matches[1])
            split(escaped_urls, fmt_urls, /,?[0-9]+\|/)
            for (fmt in fmt_urls) {
                if (fmt_urls[fmt] ~ /itag=5/) {
                    # fmt number 5 is the best video
                    INFO["video_url"] = fmt_urls[fmt]
                    close(InetFile)
                    return
                }
            }
            close(InetFile)
            return
        }
        else if (match($0, /<title>YouTube - ([^<]+)</, matches)) {
            # lets try to get the title of the video from html tag which is
            # less likely a subject to future html design changes
            INFO["title"] = matches[1]
        }
    }
    close(InetFile)
}

#
# function url_unescape
#
# given a string, it url-unescapes it.
# charactes such as %20 get converted to their ascii counterparts.
#
function url_unescape(str,    nmatches, entity, entities, seen, i) {
    nmatches = find_all_matches(str, "%[0-9A-Fa-f][0-9A-Fa-f]", entities)
    for (i = 1; i <= nmatches; i++) {
        entity = entities[i]
        if (!seen[entity]) {
            if (entity == "%26") { # special case for gsub(s, r, t), when r = '&'
                gsub(entity, "\\&", str)
            }
            else {
                gsub(entity, url_entity_unescape(entity), str)
            }
            seen[entity] = 1
        }
    }
    return str
}

#
# function find_all_matches
#
# http://awk.freeshell.org/FindAllMatches
#
function find_all_matches(str, re, arr,    j, a, b) {
    j=0
    a = RSTART; b = RLENGTH   # to avoid unexpected side effects

    while (match(str, re) > 0) {
        arr[++j] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART+RLENGTH)
    }
    RSTART = a; RLENGTH = b
    return j
}

#
# function url_entity_unescape
#
# given an url-escaped entity, such as %20, return its ascii counterpart.
#
function url_entity_unescape(entity) {
    sub("%", "", entity)
    return sprintf("%c", strtonum("0x" entity))
}

#
# function download_video
#
# takes the url to video and saves the movie to current directory using
# santized video title as filename
#
function download_video(url, title,    filename, InetFile, Request, Loop, HEADERS, FOO) {
    title = sanitize_title(title)
    filename = create_filename(title)

    parse_location(url, FOO)
    InetFile = FOO["InetFile"]
    Request  = "GET " FOO["Request"] " HTTP/1.1\r\n"
    Request  = Request "Host: " FOO["Host"] "\r\n\r\n"

    Loop = 0 # make sure we do not get caught in Location: loop
    do {     # we can get more than one redirect, follow them all
        get_headers(InetFile, Request, HEADERS)
        if ("Location" in HEADERS) { # we got redirected, let's follow the link
            close(InetFile)
            parse_location(HEADERS["Location"], FOO)
            InetFile = FOO["InetFile"]
            Request  = "GET " FOO["Request"] " HTTP/1.1\r\n"
            Request  = Request "Host: " FOO["Host"] "\r\n\r\n"
            if (InetFile == "") {
                print "Downloading '" title "' failed, couldn't parse Location header!"
                return
            }
        }
        Loop++
    } while (("Location" in HEADERS) && Loop < 5)

    if (Loop == 5) {
        print "Downloading '" title "' failed, got caught in Location loop!"
        return
    }
    
    print "Saving video to file '" filename "' (size: " bytes_to_human(HEADERS["Content-Length"]) ")..."
    save_file(InetFile, filename, HEADERS)
    close(InetFile)
    print "Successfully downloaded '" title "'!"
}

#
# function sanitize_title
#
# sanitizes the video title, by removing ()'s, replacing spaces with _, etc.
# 
function sanitize_title(title) {
    gsub(/\(|\)/, "", title)
    gsub(/[^[:alnum:]-]/, "_", title)
    gsub(/_-/, "-", title)
    gsub(/-_/, "-", title)
    gsub(/_$/, "", title)
    gsub(/-$/, "", title)
    gsub(/_{2,}/, "_", title)
    gsub(/-{2,}/, "-", title)
    return title
}

#
# function create_filename
#
# given a sanitized video title, creates a nonexisting filename
#
function create_filename(title,    filename, i) {
    filename = title ".flv"
    i = 1
    while (file_exists(filename)) {
        filename = title "-" i ".flv"
        i++
    }
    return filename
}

#
# function save_file
#
# given a special network file and filename reads from network until eof
# and saves the read contents into a file named filename
#
function save_file(Inet, filename, HEADERS,    done, cl, perc, hd, hcl) {
    OLD_RS  = RS
    OLD_ORS = ORS

    ORS = ""

    # clear the file
    print "" > filename

    # here we will do a little hackery to write the downloaded data
    # to file chunk by chunk instead of downloading it all to memory
    # and then writing
    #
    # the idea is to use a regex for the record field seperator
    # everything that gets matched is stored in RT variable
    # which gets written to disk after each match
    #
    # RS = ".{1,512}" # let's read 512 byte records

    RS = "@" # I replaced the 512 block reading with something better.
             # To read blocks I had to force users to specify --re-interval,
             # which made them uncomfortable.
             # I did statistical analysis on YouTube video files and
             # I found that hex value 0x40 appears pretty often (200 bytes or so)!
             #

    cl = HEADERS["Content-Length"]
    hcl = bytes_to_human(cl)
    done = 0
    while ((Inet |& getline) > 0) {
        done += length($0 RT)
        perc = done*100/cl
        hd = bytes_to_human(done)
        printf "Done: %d/%d bytes (%d%%, %s/%s)            \r",
            done, cl, perc, bytes_to_human(done), bytes_to_human(cl)
        print $0 RT >> filename
    }
    printf "Done: %d/%d bytes (%d%%, %s/%s)            \n",
        done, cl, perc, bytes_to_human(done), bytes_to_human(cl)

    RS  = OLD_RS
    ORS = OLD_ORS
}

#
# function get_headers
#
# given a special inet file and the request saves headers in HEADERS array
# special key "_status" can be used to find HTTP response code
# issuing another getline() on inet file would start returning the contents
#
function get_headers(Inet, Request,    HEADERS, matches, OLD_RS) {
    delete HEADERS

    # save global vars
    OLD_RS=RS

    print Request |& Inet

    # get the http status response
    if (Inet |& getline > 0) {
        HEADERS["_status"] = $2
    }
    else {
        print "Failed reading from the net. Quitting!"
        exit 1
    }

    RS="\r\n"
    while ((Inet |& getline) > 0) {
        # we could have used FS=": " to split, but i could not think of a good
        # way to handle header values which contain multiple ": "
        # so i better go with a match
        if (match($0, /([^:]+): (.+)/, matches)) {
            HEADERS[matches[1]] = matches[2]
        }
        else { break }
    }
    RS=OLD_RS
}

#
# function parse_location
#
# given a Location HTTP header value the function constructs a special
# inet file and the request storing them in FOO
#
function parse_location(location, FOO) {
    # location might look like http://cache.googlevideo.com/get_video?video_id=ID
    if (match(location, /http:\/\/([^\/]+)(\/.+)/, matches)) {
        FOO["InetFile"] = "/inet/tcp/0/" matches[1] "/80"
        FOO["Host"]     = matches[1]
        FOO["Request"]  = matches[2]
    }
    else {
        FOO["InetFile"] = ""
        FOO["Host"]     = ""
        FOO["Request"]  = ""
    }
}

# function bytes_to_human
#
# given bytes, converts them to human readable format like 13.2mb
#
function bytes_to_human(bytes,    MAP, map_idx, bytes_copy) {
    MAP[0] = "b"
    MAP[1] = "kb"
    MAP[2] = "mb"
    MAP[3] = "gb"
    MAP[4] = "tb"
   
    map_idx = 0
    bytes_copy = int(bytes)
    while (bytes_copy > 1024) {
        bytes_copy /= 1024
        map_idx++
    }

    if (map_idx > 4)
        return sprintf("%d bytes", bytes, MAP[map_idx])
    else
        return sprintf("%.02f%s", bytes_copy, MAP[map_idx])
}

#
# function file_exists
#
# given a path to file, returns 1 if the file exists, or 0 if it doesn't
#
function file_exists(file,    foo) {
    if ((getline foo <file) >= 0) {
        close(file)
        return 1
    }
    return 0
}

If you decide to learn Awk programming language, I suggest that you take a look at the Awk Cheat Sheet that I have made.

Download GNU Awk YouTube Video Downloader

Download link: gawk youtube video downloader
Total downloads: 27017 times

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the fourth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lecture six, which is on the topic of Order Statistics.

The problem of order statistics can be described as following. Given a set of N elements, find k-th smallest element in it. For example, the minimum of a set is the first order statistic (k = 1) and the maximum is the N-th order statistic (k = N).

Lecture six addresses the problem of selecting the k-th order statistic from a set of N distinct numbers.

The lecture begins with discussing the naive algorithm for finding k-th order statistic -- sort the given set (or array) A of numbers and return k-th element A[k]. Running time of such algorithm at best is O(n·log(n)).

Erik Demaine (professor) presents a randomized divide and conquer algorithm for this problem (my second post covers divide and conquer algorithms), which runs in expected linear time.

The key idea of this algorithm is to call Randomized-Partition (from previous lecture) subroutine on the given array and then recursively conquer on the partitioned elements on the left and right of the chosen pivot element.

Here is the pseudo code of this algorithm:

Randomized-Select(A, p, q, k) // finds k-th smallest element in array A[p..q]
  if p == q
    return A[p]
  r = Randomized-Partition(A, p, q)
  n = r - p + 1
  if k == n
    return A[r]
  if k < n
    return Randomized-Select(A, p, r-1, k)
  else
    return Randomized-Select(A, r+1, q, k-n)

The lecture continues with intuitive and then mathematically rigorous analysis of the algorithm. It is concluded that the expected running time of this algorithm is O(n) and the worst case (when the Randomized-Partition algorithm chooses bad pivots) is O(n2).

At the last minutes of the lecture a worst-case linear time order statistics algorithm is presented. This algorithm was invented by five scientists - M. Blum, R. W. Floyd, V. Pratt, R. Rivest and R. Tarjan. I found the original publication of this algorithm by these gentlemen: Time Bounds for Selection.

You're welcome to watch lecture six:

Lecture six notes:

MIT Algorithms Lecture 6 Notes Thumbnail. Page 1 of 2.
Lecture 6, page 1 of 2.

MIT Algorithms Lecture 6 Notes Thumbnail. Page 2 of 2.
Lecture 6, page 2 of 2.

Topics covered in lecture six:

  • [00:30] The problem of order statistics.
  • [01:50] Naive algorithm for finding k-th smallest element.
  • [04:30] Randomized divide and conquer algorithm.
  • [11:20] Example of algorithm run on array (6, 10, 13, 5, 8, 3, 2, 11) to find 7-th smallest element.
  • [15:30] Intuitive analysis of Randomized-Select algorithm.
  • [20:30] Mathematically rigorous analysis of expected running time.
  • [43:55] Worst-case linear time order statistics algorithm.

Have fun! The next post will be all about hashing!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):