GNU Awk YouTube Downloader RevisitedAround a year ago I wrote a YouTube video downloader in GNU Awk. As I explained, I did it to explore the corners of Awk language.

The key idea in writing this program was to figure out how to do networking in Awk. The original version of Awk does not have any networking capabilities, but it turned out that GNU version of Awk does! There is even a manual on "TCP/IP Networking With Gnu Awk"!

One of my blog readers, Werner Illchmann, suggested that I add a progress bar and sent me a patch. I improved it a little and here is the result:

$ chmod +x get_youtube_vids.awk
$ ./get_youtube_vids.awk http://www.youtube.com/watch?v=4bQOSRm9YiQ
Parsing YouTube video urls/IDs...
Getting video information for video: 4bQOSRm9YiQ...
Downloading: Premature Optimization is the Root of All Evil!...
Saving video to file 'Premature_Optimization_is_the_Root_of_All_Evil.flv' (size: 227.81kb)...
<strong>Done: 85121/233280 bytes (36%, 83.13kb/227.81kb)</strong>

Here is the source code of the program:

#!/usr/bin/gawk -f
#
# 2007.07.10 v1.0 - initial release
# 2007.10.21 v1.1 - youtube changed the way it displays vids
# 2008.03.01 v1.2 - youtube changed the way it displays vids
# 2008.08.28 v1.3 - added a progress bar and removed need for --re-interval 
# 2009.08.25 v1.4 - youtube changed the way it displays vids
#
# Peteris Krumins (peter@catonmat.net)
# http://www.catonmat.net -- good coders code, great reuse
#
# Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ...
# or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1>
#

BEGIN {
    if (ARGC == 1) usage();

    BINMODE = 3

    delete ARGV[0]
    print "Parsing YouTube video urls/IDs..."
    for (i in ARGV) {
        vid_id = parse_url(ARGV[i])
        if (length(vid_id) < 6) { # havent seen youtube vids with IDs < 6 chars
            print "Invalid YouTube video specified: " ARGV[i] ", not downloading!"
            continue
        }
        VIDS[i] = vid_id
    }

    for (i in VIDS) {
        print "Getting video information for video: " VIDS[i] "..."
        get_vid_info(VIDS[i], INFO)

        if (INFO["_redirected"]) {
            print "Could not get video info for video: " VIDS[i]
            continue 
        }

        if (!INFO["video_url"]) {
            print "Could not get video_url for video: " VIDS[i]
            print "Please goto my website, and submit a comment with an URL to this video, so that I can fix it!"
            print "Url: http://www.catonmat.net/blog/downloading-youtube-videos-with-gawk/"
            continue
        }
        if ("title" in INFO) {
            print "Downloading: " INFO["title"] "..."
            title = INFO["title"]
        }
        else {
            print "Could not get title for video: " VIDS[i]
            print "Trying to download " VIDS[i] " anyway"
            title = VIDS[i]
        }
        download_video(INFO["video_url"], title)
    }
}

function usage() {
    print "Downloading YouTube Videos with GNU Awk"
    print
    print "Peteris Krumins (peter@catonmat.net)"
    print "http://www.catonmat.net  --  good coders code, great reuse"
    print 
    print "Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
    print "or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
    exit 1
}

#
# function parse_url
#
# takes a url or an ID of a youtube video and returns just the ID
# for example the url could be the full url: http://www.youtube.com/watch?v=ID
# or it could be www.youtube.com/watch?v=ID
# or just youtube.com/watch?v=ID or http://youtube.com/watch?v=ID
# or just the ID
#
function parse_url(url) {
    gsub(/http:\/\//, "", url)                # get rid of http:// part
    gsub(/www\./,     "", url)                # get rid of www.    part
    gsub(/youtube\.com\/watch\?v=/, "", url)  # get rid of youtube.com... part

    if ((p = index(url, "&")) > 0)      # get rid of &foo=bar&... after the ID
        url = substr(url, 1, p-1)

    return url
}

#
# function get_vid_info
#
# function takes the youtube video ID and gets the title of the video
# and the url to .flv file
#
function get_vid_info(vid_id, INFO,    InetFile, Request, HEADERS, matches, escaped_urls, fmt_urls, fmt) {
    delete INFO
    InetFile = "/inet/tcp/0/www.youtube.com/80"
    Request = "GET /watch?v=" vid_id " HTTP/1.1\r\n"
    Request = Request "Host: www.youtube.com\r\n\r\n"

    get_headers(InetFile, Request, HEADERS)
    if ("Location" in HEADERS) {
        INFO["_redirected"] = 1
        close(InetFile)
        return
    }

    # fix this bug:
    # http://www.youtube.com/watch?v=nb1u7wMKywM
    while ((InetFile |& getline) > 0) {
        if (match($0, /"fmt_url_map": "([^"]+)"/, matches)) {
            escaped_urls = url_unescape(matches[1])
            split(escaped_urls, fmt_urls, /,?[0-9]+\|/)
            for (fmt in fmt_urls) {
                if (fmt_urls[fmt] ~ /itag=5/) {
                    # fmt number 5 is the best video
                    INFO["video_url"] = fmt_urls[fmt]
                    close(InetFile)
                    return
                }
            }
            close(InetFile)
            return
        }
        else if (match($0, /<title>YouTube - ([^<]+)</, matches)) {
            # lets try to get the title of the video from html tag which is
            # less likely a subject to future html design changes
            INFO["title"] = matches[1]
        }
    }
    close(InetFile)
}

#
# function url_unescape
#
# given a string, it url-unescapes it.
# charactes such as %20 get converted to their ascii counterparts.
#
function url_unescape(str,    nmatches, entity, entities, seen, i) {
    nmatches = find_all_matches(str, "%[0-9A-Fa-f][0-9A-Fa-f]", entities)
    for (i = 1; i <= nmatches; i++) {
        entity = entities[i]
        if (!seen[entity]) {
            if (entity == "%26") { # special case for gsub(s, r, t), when r = '&'
                gsub(entity, "\\&", str)
            }
            else {
                gsub(entity, url_entity_unescape(entity), str)
            }
            seen[entity] = 1
        }
    }
    return str
}

#
# function find_all_matches
#
# http://awk.freeshell.org/FindAllMatches
#
function find_all_matches(str, re, arr,    j, a, b) {
    j=0
    a = RSTART; b = RLENGTH   # to avoid unexpected side effects

    while (match(str, re) > 0) {
        arr[++j] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART+RLENGTH)
    }
    RSTART = a; RLENGTH = b
    return j
}

#
# function url_entity_unescape
#
# given an url-escaped entity, such as %20, return its ascii counterpart.
#
function url_entity_unescape(entity) {
    sub("%", "", entity)
    return sprintf("%c", strtonum("0x" entity))
}

#
# function download_video
#
# takes the url to video and saves the movie to current directory using
# santized video title as filename
#
function download_video(url, title,    filename, InetFile, Request, Loop, HEADERS, FOO) {
    title = sanitize_title(title)
    filename = create_filename(title)

    parse_location(url, FOO)
    InetFile = FOO["InetFile"]
    Request  = "GET " FOO["Request"] " HTTP/1.1\r\n"
    Request  = Request "Host: " FOO["Host"] "\r\n\r\n"

    Loop = 0 # make sure we do not get caught in Location: loop
    do {     # we can get more than one redirect, follow them all
        get_headers(InetFile, Request, HEADERS)
        if ("Location" in HEADERS) { # we got redirected, let's follow the link
            close(InetFile)
            parse_location(HEADERS["Location"], FOO)
            InetFile = FOO["InetFile"]
            Request  = "GET " FOO["Request"] " HTTP/1.1\r\n"
            Request  = Request "Host: " FOO["Host"] "\r\n\r\n"
            if (InetFile == "") {
                print "Downloading '" title "' failed, couldn't parse Location header!"
                return
            }
        }
        Loop++
    } while (("Location" in HEADERS) && Loop < 5)

    if (Loop == 5) {
        print "Downloading '" title "' failed, got caught in Location loop!"
        return
    }
    
    print "Saving video to file '" filename "' (size: " bytes_to_human(HEADERS["Content-Length"]) ")..."
    save_file(InetFile, filename, HEADERS)
    close(InetFile)
    print "Successfully downloaded '" title "'!"
}

#
# function sanitize_title
#
# sanitizes the video title, by removing ()'s, replacing spaces with _, etc.
# 
function sanitize_title(title) {
    gsub(/\(|\)/, "", title)
    gsub(/[^[:alnum:]-]/, "_", title)
    gsub(/_-/, "-", title)
    gsub(/-_/, "-", title)
    gsub(/_$/, "", title)
    gsub(/-$/, "", title)
    gsub(/_{2,}/, "_", title)
    gsub(/-{2,}/, "-", title)
    return title
}

#
# function create_filename
#
# given a sanitized video title, creates a nonexisting filename
#
function create_filename(title,    filename, i) {
    filename = title ".flv"
    i = 1
    while (file_exists(filename)) {
        filename = title "-" i ".flv"
        i++
    }
    return filename
}

#
# function save_file
#
# given a special network file and filename reads from network until eof
# and saves the read contents into a file named filename
#
function save_file(Inet, filename, HEADERS,    done, cl, perc, hd, hcl) {
    OLD_RS  = RS
    OLD_ORS = ORS

    ORS = ""

    # clear the file
    print "" > filename

    # here we will do a little hackery to write the downloaded data
    # to file chunk by chunk instead of downloading it all to memory
    # and then writing
    #
    # the idea is to use a regex for the record field seperator
    # everything that gets matched is stored in RT variable
    # which gets written to disk after each match
    #
    # RS = ".{1,512}" # let's read 512 byte records

    RS = "@" # I replaced the 512 block reading with something better.
             # To read blocks I had to force users to specify --re-interval,
             # which made them uncomfortable.
             # I did statistical analysis on YouTube video files and
             # I found that hex value 0x40 appears pretty often (200 bytes or so)!
             #

    cl = HEADERS["Content-Length"]
    hcl = bytes_to_human(cl)
    done = 0
    while ((Inet |& getline) > 0) {
        done += length($0 RT)
        perc = done*100/cl
        hd = bytes_to_human(done)
        printf "Done: %d/%d bytes (%d%%, %s/%s)            \r",
            done, cl, perc, bytes_to_human(done), bytes_to_human(cl)
        print $0 RT >> filename
    }
    printf "Done: %d/%d bytes (%d%%, %s/%s)            \n",
        done, cl, perc, bytes_to_human(done), bytes_to_human(cl)

    RS  = OLD_RS
    ORS = OLD_ORS
}

#
# function get_headers
#
# given a special inet file and the request saves headers in HEADERS array
# special key "_status" can be used to find HTTP response code
# issuing another getline() on inet file would start returning the contents
#
function get_headers(Inet, Request,    HEADERS, matches, OLD_RS) {
    delete HEADERS

    # save global vars
    OLD_RS=RS

    print Request |& Inet

    # get the http status response
    if (Inet |& getline > 0) {
        HEADERS["_status"] = $2
    }
    else {
        print "Failed reading from the net. Quitting!"
        exit 1
    }

    RS="\r\n"
    while ((Inet |& getline) > 0) {
        # we could have used FS=": " to split, but i could not think of a good
        # way to handle header values which contain multiple ": "
        # so i better go with a match
        if (match($0, /([^:]+): (.+)/, matches)) {
            HEADERS[matches[1]] = matches[2]
        }
        else { break }
    }
    RS=OLD_RS
}

#
# function parse_location
#
# given a Location HTTP header value the function constructs a special
# inet file and the request storing them in FOO
#
function parse_location(location, FOO) {
    # location might look like http://cache.googlevideo.com/get_video?video_id=ID
    if (match(location, /http:\/\/([^\/]+)(\/.+)/, matches)) {
        FOO["InetFile"] = "/inet/tcp/0/" matches[1] "/80"
        FOO["Host"]     = matches[1]
        FOO["Request"]  = matches[2]
    }
    else {
        FOO["InetFile"] = ""
        FOO["Host"]     = ""
        FOO["Request"]  = ""
    }
}

# function bytes_to_human
#
# given bytes, converts them to human readable format like 13.2mb
#
function bytes_to_human(bytes,    MAP, map_idx, bytes_copy) {
    MAP[0] = "b"
    MAP[1] = "kb"
    MAP[2] = "mb"
    MAP[3] = "gb"
    MAP[4] = "tb"
   
    map_idx = 0
    bytes_copy = int(bytes)
    while (bytes_copy > 1024) {
        bytes_copy /= 1024
        map_idx++
    }

    if (map_idx > 4)
        return sprintf("%d bytes", bytes, MAP[map_idx])
    else
        return sprintf("%.02f%s", bytes_copy, MAP[map_idx])
}

#
# function file_exists
#
# given a path to file, returns 1 if the file exists, or 0 if it doesn't
#
function file_exists(file,    foo) {
    if ((getline foo <file) >= 0) {
        close(file)
        return 1
    }
    return 0
}

If you decide to learn Awk programming language, I suggest that you take a look at the Awk Cheat Sheet that I have made.

Download GNU Awk YouTube Video Downloader

Download link: gawk youtube video downloader
Total downloads: 27238 times

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the fourth post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lecture six, which is on the topic of Order Statistics.

The problem of order statistics can be described as following. Given a set of N elements, find k-th smallest element in it. For example, the minimum of a set is the first order statistic (k = 1) and the maximum is the N-th order statistic (k = N).

Lecture six addresses the problem of selecting the k-th order statistic from a set of N distinct numbers.

The lecture begins with discussing the naive algorithm for finding k-th order statistic -- sort the given set (or array) A of numbers and return k-th element A[k]. Running time of such algorithm at best is O(n·log(n)).

Erik Demaine (professor) presents a randomized divide and conquer algorithm for this problem (my second post covers divide and conquer algorithms), which runs in expected linear time.

The key idea of this algorithm is to call Randomized-Partition (from previous lecture) subroutine on the given array and then recursively conquer on the partitioned elements on the left and right of the chosen pivot element.

Here is the pseudo code of this algorithm:

Randomized-Select(A, p, q, k) // finds k-th smallest element in array A[p..q]
  if p == q
    return A[p]
  r = Randomized-Partition(A, p, q)
  n = r - p + 1
  if k == n
    return A[r]
  if k < n
    return Randomized-Select(A, p, r-1, k)
  else
    return Randomized-Select(A, r+1, q, k-n)

The lecture continues with intuitive and then mathematically rigorous analysis of the algorithm. It is concluded that the expected running time of this algorithm is O(n) and the worst case (when the Randomized-Partition algorithm chooses bad pivots) is O(n2).

At the last minutes of the lecture a worst-case linear time order statistics algorithm is presented. This algorithm was invented by five scientists - M. Blum, R. W. Floyd, V. Pratt, R. Rivest and R. Tarjan. I found the original publication of this algorithm by these gentlemen: Time Bounds for Selection.

You're welcome to watch lecture six:

Lecture six notes:

MIT Algorithms Lecture 6 Notes Thumbnail. Page 1 of 2.
Lecture 6, page 1 of 2.

MIT Algorithms Lecture 6 Notes Thumbnail. Page 2 of 2.
Lecture 6, page 2 of 2.

Topics covered in lecture six:

  • [00:30] The problem of order statistics.
  • [01:50] Naive algorithm for finding k-th smallest element.
  • [04:30] Randomized divide and conquer algorithm.
  • [11:20] Example of algorithm run on array (6, 10, 13, 5, 8, 3, 2, 11) to find 7-th smallest element.
  • [15:30] Intuitive analysis of Randomized-Select algorithm.
  • [20:30] Mathematically rigorous analysis of expected running time.
  • [43:55] Worst-case linear time order statistics algorithm.

Have fun! The next post will be all about hashing!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the third post in an article series about MIT's lecture course "Introduction to Algorithms." In this post I will review lectures four and five, which are on the topic of sorting.

The previous post covered a lecture on "Divide and Conquer" algorithm design technique and its applications.

Lecture four is devoted entirely to a single sorting algorithm which uses this technique. The algorithm I am talking about is the "Quicksort" algorithm. The quicksort algorithm was invented by Charles Hoare in 1962 and it is the most widely used sorting algorithm in practice.

I wrote about quicksort before in "Three Beautiful Quicksorts" post, where its running time was analyzed experimentally. This lecture does it theoretically.

Lecture five talks about theoretical running time limits of sorting using comparisons and then discusses two linear time sorting algorithms -- Counting sort and Radix sort.

Lecture 4: Quicksort

The lecture starts by giving the divide and conquer description of the algorithm:

  • Divide: Partition the array to be sorted into two subarrays around pivot x, such that elements in the lower subarray <= x, and elements in the upper subarray >= x.
  • Conquer: Recursively sort lower and upper subarrays using quicksort.
  • Combine: Since the subarrays are sorted in place, no work is needed to combine them. The entire array is now sorted!

The main algorithm can be written in a few lines of code:

Quicksort(A, p, r) // sorts list A[p..r]
  if p < r
    q = Partition(A, p, r)
    Quicksort(A, p, q-1)
    Quicksort(A, q+1, r)

The key in this algorithm is the Partition subroutine:

Partition(A, p, q)
  // partitions elements in array A[p..q] in-place around element x = A[p],
  // so that A[p..i] <= x and A[i] == x and A[i+1..q] >= x
  x = A[p]
  i = p
  for j = p+1 to q
    if A[j] <= x
      i = i+1
      swap(A[i], A[j])
  swap(A[p], A[i])
  return i

The lecture then proceeds with the analysis of Quicksort's worst-case running time. It is concluded that if the array to be sorted is already sorted or already reverse sorted, the running time is O(n2), which is no better than Insertion sort (seen in lecture one)! What happens in the worst case is that all the elements get partitioned to one side of the chosen pivot. For example, given an array (1, 2, 3, 4), the partition algorithm chooses 1 as the pivot, then all the elements stay where they were, and no partitioning actually happens. To overcome this problem Randomized Quicksort algorithm is introduced.

The main idea of randomized quicksort hides in the Partition subroutine. Instead of partitioning around the first element, it partition around a random element in the array!

Nice things about randomized quicksort are:

  • Its running time is independent of initial element order.
  • No assumptions need to be made about the statistical distribution of input.
  • No specific input elicits the worst case behavior.
  • The worst case is determined only by the output of a random-number generator.
Randomized-Partition(A, p, q)
  <strong>swap(A[p], A[rand(p,q)])</strong> // the only thing changed in the original Partition subroutine!
  x = A[p]
  i = p
  for j = p+1 to q
    if A[j] <= x
      i = i+1
      swap(A[i], A[j])
  swap(A[p], A[i])
  return i

The rest of the lecture is very math-heavy and uses indicator random variables to conclude that the expected running time of randomized quicksort is O(n·lg(n)).

In practice quicksort is 3 or more times faster than merge sort.

Please see Three Beautiful Quicksorts post for more information about the version of industrial-strength quicksort.

You're welcome to watch lecture four:

Topics covered in lecture four:

  • [00:35] Introduction to quicksort.
  • [02:30] Divide and conquer approach to quicksort.
  • [05:20] Key idea - linear time (Θ(n)) partitioning subroutine.
  • [07:50] Structure of partitioning algorithm.
  • [11:40] Example of partition algorithm run on array (6, 10, 13, 5, 8, 3, 2, 11).
  • [16:00] Pseudocode of quicksort algorithm.
  • [19:00] Analysis of quicksort algorithm.
  • [20:25] Worst case analysis of quicksort.
  • [24:15] Recursion tree for worst case.
  • [28:55] Best case analysis of quicksort, if partition splits elements in half.
  • [28:55] Analysis of quicksort, if partition splits elements in a proportion 1/10 : 9/10.
  • [33:33] Recursion tree of this analysis.
  • [04:30] Analysis of quicksort, if partition alternates between lucky, unlucky, lucky, unlucky, ...
  • [46:50] Randomized quicksort.
  • [51:10] Analysis of randomized quicksort.

Lecture four notes:

MIT Algorithms Lecture 4 Notes Thumbnail. Page 1 of 2.
Lecture 4, page 1 of 2.

MIT Algorithms Lecture 4 Notes Thumbnail. Page 2 of 2.
Lecture 4, page 2 of 2.

Lecture 5: Lower Sorting Bounds and Linear Sorting

The lecture starts with a question -- How fast can we sort? Erik Demaine (professor) answers and says that it depends on the model of what you can do with the elements.

The previous lectures introduced several algorithms that can sort n numbers in O(n·lg(n)) time. Merge sort achieves this upper bound in the worst case; quicksort achieves it on average. These algorithms share an interesting property: the sorted order they determine is based only on comparisons between the input elements. Such sorting algorithms are called Comparison Sorts, which is a model for sorting.

It turns out that any comparison sort algorithm can be translated into something that is called a Decision Tree. A decision tree is a full binary tree that represents the comparisons between elements that are performed by a particular sorting algorithm operating on an input of a given size.

Erik uses decision trees to derive the lower bound for running time of comparison based sorting algorithms. The result is that no comparison-based sort can do better than O(n·lg(n)).

The lecture continues with bursting outside of comparison model and looks at sorting in linear time using no comparisons.

The first linear time algorithm covered in the lecture is Counting Sort. The basic idea of counting sort is to determine, for each input element x, the number of elements less than x. This information can be used to place element x directly into its position in the output array. For example, if there are 17 elements less than x, then x belongs in output position 18.

The second linear time algorithm is Radix Sort, which sorts a list of numbers by examining each digit at a given position separately.

Erik ends the lecture by analyzing correctness and running time of radix sort.

Video of lecture five:

Topics covered in lecture five:

  • [00:30] How fast can we sort?
  • [02:27] Review of running times of quicksort, heapsort, merge sort and insertion sort.
  • [04:50] Comparison sorting (model for sorting).
  • [06:50] Decision trees.
  • [09:35] General description of decision trees.
  • [14:25] Decision trees model comparison sorts.
  • [20:00] Lower bound on decision tree sorting.
  • [31:35] Sorting in linear time.
  • [32:30] Counting sort.
  • [38:05] Example of counting sort run on an array (4, 1, 3, 4, 3)
  • [50:00] Radix sort.
  • [56:10] Example of radix sort run on array (329, 457, 657, 839, 546, 720, 355).
  • [01:00:30] Correctness of radix sort.
  • [01:04:25] Analysis of radix sort.

Lecture five notes:

MIT Algorithms Lecture 5 Notes Thumbnail. Page 1 of 2.
Lecture 5, page 1 of 2.

MIT Algorithms Lecture 5 Notes Thumbnail. Page 2 of 2.
Lecture 5, page 2 of 2.

Have fun sorting! The next post will be about order statistics (given an array find n-th smallest element)!

PS. This course is taught from the CLRS book (also called "Introduction to Algorithms"):

This article is part of the article series "Musical Geek Friday."
<- previous article next article ->

Mc Plus - Alice and Bob SongThis week on Musical Geek Friday a song about the lovely cryptographic couple Alice and Bob!

Alice and Bob song is written by a guy MC Plus+. His real name is Armand Navabi and he's computer science Ph.D. student at Purdue University. His moniker is a pun on the name C++.

The song is about three different topics in cryptography.

First, the song talks about various archetypes used in cryptography. They are Alice, Bob, Trent, Mallory and Eve. Alice and Bob are the usual persons trying to communicate securely. They both trust Trent, who helps them communicate. Meanwhile Eve tries to eavesdrop their communication and Mallory tries to modify their messages!

Then, it covers cryptography algorithms -- the insecure 56-bit Data Encryption Standard (DES) algorithm, secure Advanced Encryption Standard (AES) algorithm, and not-so-secure Blowfish algorithm.

Lastly, the song reaches the problem of factoring numbers for which no algorithm in polynomial time is known.

This song is similar to the first song I ever posted on Musical Geek Friday -- Crypto.

Here it is! The Alice and Bob song:

[audio:http://www.catonmat.net/download/mc_plus_plus-alice_and_bob.mp3]

Download this song: alice and bob.mp3 (musical geek friday #14)
Downloaded: 21238 times

Download lyrics: alice and bob lyrics (musical geek friday #14)
Downloaded: 3397 times

Alice and Bob lyrics:

Alice is sending her message to Bob
Protecting that transmission is Crypto's job
Without the help of our good friend Trent,
It's hard to get that secret message sent
Work tries to deposit the check of your salary
But with no crypto, it'll be changed by Mallory
You think no one will see what it is, you believe?
But you should never forget, there's always an Eve...

[Chorus]
'Cause I'm encrypting s**t like every single day
Sending data across the network in a safe way
Protecting messages to make my pay
If you hack me, you're guilty under DMCA

DES is wrong if you listen to NIST
Double DES ain't no better man, that got dissed
Twofish for AES, that was Schneier's wish
Like a shot from the key, Rijndael made the swish
But Blowfish is still the fastest in the land
And Bruce used his fame to make a few grand
Use ECB, and I'll crack your ciphertext
Try CFB mode to keep everyone perplexed

[Chorus]
'Cause I'm encrypting s**t like every single day
Sending data across the network in a safe way
Protecting messages to make my pay
If you hack me, you're guilty under DMCA

Random numbers ain't easy to produce...
Do it wrong, and your key I'll deduce
RSA, only public cipher in the game
Creating it helped give Rivest his fame
If we could factor large composites in poly time,
We'd have enough money to not have to rhyme
Digesting messages with a hashing function
Using SHA1 or else it won't cause disfunction

[Chorus]
'Cause I'm encrypting s**t like every single day
Sending data across the network in a safe way
Protecting messages to make my pay
If you hack me, you're guilty under DMCA

Password confirmed. Stand by...

Download "Alice and Bob" Song

Download this song: alice and bob.mp3 (musical geek friday #14)
Downloaded: 21238 times

Download lyrics: alice and bob lyrics (musical geek friday #14)
Downloaded: 3397 times

Click to listen:
[audio:http://www.catonmat.net/download/mc_plus_plus-alice_and_bob.mp3]

Have fun and until next geeky Friday! :)

This article is part of the article series "MIT Introduction to Algorithms."
<- previous article next article ->

MIT AlgorithmsThis is the second post in an article series about MIT's lecture course "Introduction to Algorithms."

I changed my mind a little on how I will be posting the reviews of lectures. In the first post I said that I will be posting reviews of two or three lectures at a time, but I decided to group them by topic instead. This is more logical. Some lectures stand out alone. Lecture 3 is like that.

Lecture three is all all about a powerful technique in algorithm design called "Divide and Conquer." I won't be able to describe it better than the CLRS book, which says: "Divide and Conquer algorithms break the problem into several subproblems that are similar to the original problem but smaller in size, solve the subproblems recursively, and then combine these solutions to create a solution to the original problem."

There are three steps to applying Divide and Conquer algorithm in practice:

  • Divide the problem into one ore more subproblems.
  • Conquer subproblems by solving them recursively. If the subproblem sizes are small enough, however, just solve the subproblems in a straightforward manner.
  • Combine the solutions to the subproblems into the solution for the original problem.

An example of Divide and Conquer is the Merge Sort algorithm covered in lecture one:

  • Divide: Divide the n-element sequence to be sorted into two subsequences of n/2 elements each.
  • Conquer: Sort the two subsequences recursively using Merge Sort.
  • Combine: Merge the two sorted subsequences to produce the sorted answer.

The recursion stops when the sequence to be sorted has length 1, in which case there is nothing else to be done, since every sequence of length 1 is already in sorted order!

Lecture three presents various other algorithms which use the powerful Divide and Conquer technique.

It starts by the running time analysis of Merge Sort algorithms and shows the general structure of recurrence equations generated by Divide and Conquer algorithms.

Then it moves on to Binary Search algorithm which allows finding an element in a sorted array in time proportional to logarithm of array length, or speaking in asymptotic notation speech, it's worst running time complexity is O(lg(n)).

A naive algorithm which just scanned through all elements of an array would have time complexity of O(n) which is much, much worse!

The next algorithm presented solves the problem stated as following: given a number x and given an integer n, compute x to the power n.

A naive algorithm would create a loop from 1 to n, and multiply the x with itself:

result = 1
Naive-Power(x, n):
  for 1 to n:
    result = x * result
  return result

This leads to running time of O(n).

A much more clever approach is to use an algorithm called Recursive Powering. The idea to calculate xn is to express xn as:

  • xn/2­·xn/2­ if n is even, or
  • x(n-1)/2·­x(n-1)/2·x, if n is odd.

Here is the pseudo-code of this algorithm:

Recursive-Power(x, n):
  if n == 1
    return x
  if n is even
    y = Recursive-Power(x, n/2)
    return y*y
  else
    y = Recursive-Power(x, (n-1)/2)
    return y*y*x

Doing asymptotic analysis of this algorithm leads to O(lg(n)) running time, which is a great improvement over O(n).

The lecture continues with four algorithms on how to compute Fibonacci Numbers! You should definitely watch that! Man, it's four algorithms! It's at 17:49 in the lecture!

I won't give the pseudo code here for these ones, but they are naive recursive algorithm, bottom up algorithm, naive recursive squaring and recursive squaring!

The next two topics are Matrix Multiplication the naive way, giving O(n3) running time, and an improved Strassen's Algorithm which reduces the number of multiplications yielding running time of approximately O(n2.81).

The last topic covered in this lecture is VLSI (Very large Scale Integration) layout problem: given a number of various electronic gates an chips, how to position them on the circuit board to minimize area they occupy.

Having described the lecture in great detail, you're welcome to watch it:

Topics covered in lecture 3:

  • [01:25] Divide and conquer method.
  • [04:54] Example of Divide and conquer applied to Merge sort.
  • [06:45] Running time analysis of Merge sort.
  • [09:55] Binary search algorithm.
  • [12:00] Analysis of binary search running time.
  • [13:15] Algorithm for powering a number (recursive powering).
  • [17:49] Algorithms for computing Fibonacci numbers (FBs).
  • [19:04] Naive recursive algorithm (exponential time) for computing FBs.
  • [22:45] Bottom-up algorithm for computing FBs.
  • [24:25] Naive recursive squaring algorithm for FBs (doesn't work because of floating point rounding errors).
  • [27:00] Recursive squaring algorithm for FBs.
  • [29:54] Proof by induction of recursive squaring FB algorithm
  • [34:00] Matrix multiplication algorithms.
  • [35:45] Naive (standard) algorithm for multiplying matrices.
  • [37:35] Divide and conquer algorithm for multiplying matrices.
  • [40:00] Running time analysis of divide and conquer matrix multiplication algorithm.
  • [43:09] Strassen's matrix multiplication algorithm.
  • [50:00] Analysis of Strassen's algorithm.
  • [55:25] VLSI layout problem.
  • [57:50] Naive embedding algorithm for VLSI.
  • [01:05:10] Divide and conquer algorithm for laying out VLSI.

Lecture 3 notes:

MIT Algorithms Lecture 3 Notes Thumbnail. Page 1 of 2.
Lecture 3, page 1 of 2.

MIT Algorithms Lecture 3 Notes Thumbnail. Page 2 of 2.
Lecture 3, page 2 of 2.

Have fun dividing and conquering! The next post will be about two lectures on sorting algorithms!

Ps. the lectures are taught from the CLRS book (also called "Introduction to Algorithms"):