Perl√Ęs Camel and Special Variables Trying to eat him In my previous post - downloading youtube videos with a Perl one-liner - I used Perl's special variable $_. This is just one of many Perl's special variables.

One day last year I decided to go through all the Perl's special variables carefully so I could become a better Perl programmer and make a cheat sheet which I could always keep on my desk and look things up when needed.

As I explained in the previous cheat-sheet post (awk's cheat sheet), the cheat sheets are for learning things better and not doing things blindly and looking the stuff each time up. Here is another example why cheat sheets are helpful - suppose you programmed for 5 hours and then sat back in your chair to relax for 10 minutes. You could relax by taking cheat sheet in your hands and just scan through it and remember a thing or two.

This cheat sheet contains all the perl's special variables, their description and examples where possible.

It can be nicely printed on one sheet of paper by having two pages per side. Two on one side and two on other. That's how I have printed it.

Available as usual in .doc and .pdf formats.

Download Perl's Special Variable Cheat-Sheet

PDF:
Download link: perl's special variable cheat sheet (.pdf)
Downloaded: 104795 times

Microsoft Word 2000 format (.doc):
Download link: perl's special variable cheat sheet (.doc)
Downloaded: 5941 times

Are you interested in Perl programming language? Here are three excellent books on Perl from Amazon (recommended by me):

purl purl, youtube perl one liner downloaderLast time I explained how YouTube videos can be downloaded with gawk programming language by getting the YouTube page where the video is displayed and finding out how the flash video player retrieves the FLV (flash video) media file.

This time I'll use Perl programming language which is my favorite language at the moment and write a one-liner which downloads a YouTube video.

Instead of parsing the YouTube video page, let's look how an embedded YouTube video player on a 3rd party website gets the video.

Let's go to this cool video and look at the embed html code:

html code for embedded youtube video

For this video it looks as following:

<object width="425" height="350"><param name="movie" value="http://www.youtube.com/v/qg1ckCkm8YI"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/qg1ckCkm8YI" type="application/x-shockwave-flash" wmode="transparent" width="425" height="350"></embed></object>

95% of this code is boring, the only interesting part is this URL:

http://www.youtube.com/v/qg1ckCkm8YI

Let's load this in a browser, and as we do it, we get redirected to some other URL:

http://www.youtube.com/jp.swf?video_id=qg1ckCkm8YI&eurl=&iurl=http://img.youtube.com/vi/qg1ckCkm8YI/default.jpg&t=OEgsToPDskJCPW5DvMKeM3srnQ5e0LSY

So far we have no information how the flash player will retrieve the video, the only thing we know that 'iurl' stands for 'image url' and is the location of the thumbnail image.

Let's sniff the traffic again, this time with an excellent (though, commercial) Internet Explorer plugin 'HttpWatch Professional'.
This plugin displays all the requests the browser makes no matter if it's HTTP or HTTPS traffic and displays in a nice manner which makes our job much quicker than by using Ethereal.
The FireFox's alternative to this tool is Live HTTP Headers extension which basically does the same as HttpWatch Professional but it takes more time to understand the output.

Here is what we see with HttpWatch Professional when we load the URL in the browser:

output of httpwatch professional sniffer (sane thumbnail (not the wordpress insanity))

We see that to get a video browser first requested:

http://www.youtube.com/get_video?video_id=qg1ckCkm8YI&t=OEgsToPDskJ3bp4DEiMuxUmjx7oumUec&eurl=

then got redirected to:

http://cache.googlevideo.com/get_video?video_id=qg1ckCkm8YI

and then another time to:

http://74.125.13.83/get_video?video_id=qg1ckCkm8YI

This is exactly what what we saw in the previous article on downloading videos with gawk!

Now let's write a Perl one-liner that retrieves this video file!

What is a one-liner you might ask? Well, my definition of one liner is that it is a program you are willing to type out without saving it to disk.

First of all we will need some perl packages (modules) which will ease working with HTTP protocol. There are two widely used available on Perl's module archive (CPAN) - LWP and WWW::Mechanize.

WWW::Mechanize is built on top of LWP, so let's go to a higher level of abstraction and use this module.

The WWW::Mechanize package does not come as Perl's core package by default, so you'll have to get it installed.
To do it, type

perl -MCPAN -eshell

In your console and when the CPAN shell appears, type

install WWW::Mechanize

to get the module installed.

If everything goes fine, the CPAN will tell you that the module got installed.

I don't want to go into Perl language's details again, also I don't want to go into WWW::Mechanize package's details.

If you want to learn Perl I recommend this article as a starter, these books and of course perldoc. Once you learn the basics you can quickly pick up the WWW::Mechanize package by reading the documentation, faq and trying examples.

Now finally let's write the one-liner. So what do we have to do?

First we have to retrieve

http://www.youtube.com/v/qg1ckCkm8YI

then follow the redirect (which WWW::Mechanize will do for us), then get the 't' identifier from query string and finally request and save output of

http://www.youtube.com/get_video?video_id=qg1ckCkm8YI&t=OEgsToPDskJ3bp4DEiMuxUmjx7oumUec&eurl=

That's it!

So here is the final version which can probably be made even shorter:

perl -MWWW::Mechanize -e '$_ = shift; s#http://|www\.|youtube\.com/|watch\?|v=|##g; $m = WWW::Mechanize->new; ($t = $m->get("http://www.youtube.com/v/$_")->request->uri) =~ s/.*&t=(.+)/$1/; $m->get("http://www.youtube.com/get_video?video_id=$_&t=$t", ":content_file" => "$_.flv")'

A little longer than a usual one-liner but does the job nicely. To keep it short, there is no error checking!

To use this one-liner just copy it to command line and specify the URL of a YouTube video (or just the ID of the video, or a variation of URL (like without 'http://'). Like this:

perl -MWWW::Mechanize -e '...' http://www.youtube.com/watch?v=l69Vi5IDc0g

or just

perl -MWWW::Mechanize -e '...' l69Vi5IDc0g

Let's spread this one liner to multiple lines and see what it does as it is not documented.
One could do the spreading out to multiple lines by hand, but that's not what humans are for, let's make Perl do it. By adding -MO=Deparse to the command line list we get the output of the Perl generated source code (i added line numbers myself):

use WWW::Mechanize;
1) $_ = shift @ARGV;
2) s[http://|www\.|youtube\.com/|watch\?|v=|][]g;
3) $m = 'WWW::Mechanize'->new;
4) ($t = $m->get("http://www.youtube.com/v/$_")->request->uri) =~ s/.*&t=(.+)/$1/;
5) $m->get("http://www.youtube.com/get_video?video_id=$_&t=$t", ':content_file', "$_.flv");

So our one liner is actually 5 lines.
On line 1 we put the first argument of ARGV variable into special variable $_ so we could use advantage of it and save some typing.
On line 2 we just leave the ID of the video by removing parts from the URL one by one so a user could specify the video URL in various formats like 'www.youtube.com/watch?v=ID, or just 'youtube.com?v=ID' or just 'v=ID' or even just 'ID'. The ID gets stored in the special $_ variable.
On line 3 we create a WWW::Mechanize object we are going to use twice.
Line 4 needs more explanation because we are doing so much in it. First it retrieves that embedded video URL I talked about earlier, the server actually redirects us away, so we have to look at the last request's location. We save this location into variable $t and then extract the 't' YouTube ID out.
As a YouTube video is uniquely specifed with two IDs, the video ID and 't' ID, on line 5 we retrieve the file and tell WWW::Mechanize to save contents to the ID.flv file. WWW::Mechanize handles redirects for us so everything should work. Indeed, I tested it out and it worked.

Can you golf it shorter? :)

I golfed it a little myself, here is what I came up with:

perl -MWWW::Mechanize -e '$_ = shift; ($y, $i) = m#(http://www\.youtube\.com)/watch\?v=(.+)#; $m = WWW::Mechanize->new; ($t = $m->get("$y/v/$i")->request->uri) =~ s/.*&t=(.+)/$1/; $m->get("$y/get_video?video_id=$i&t=$t", ":content_file" => "$i.flv")'

To use this one liner you must specify the full URL to youtube video, like this one:

http://www.youtube.com/watch?v=l69Vi5IDc0g

This one liner saves the "http://www.youtube.com" string in variable $y and the ID of the video in variable $i. The $y comes handy because we don't have to use the full YouTube URL, instead we use use $y.

Update 2009.12.05: YouTube has changed the way it displays videos several times! The current one-liner is here:

perl -MWWW::Mechanize -e '$m = WWW::Mechanize->new; $_=shift; ($i) = /v=(.+)/; s/%(..)/chr(hex($1))/ge for (($u) = $m->get($_)->content =~ /l_map": .+(?:%2C)?5%7C(.+?)"/); print $i, "\n"; $m->get($u, ":content_file" => "$i.flv")'

Also, are you interested in Perl programming language? Here are three excellent books on Perl from Amazon (recommended by me):

awk cheat sheetI love cheat sheets. I have at least 30 in front of me at the moment scattered throughout my desk: awk cheat sheet, vt100/ansi screen terminal emulation keyboard mapping cheat sheet, emacs command line editing mode cheat sheet, ethernet, ip and tcp header layout cheat sheets and many many others. I am glad that I can share the cheat-sheets with you so you can become a better developer! I have made some 15 out of 30 cheat sheets myself and I will make them available for download in the future posts.

Someone might say that looking the information up in a man or info page, manual or Google is equivalent to looking it up on a cheat sheet, if not better. I disagree.

First of all it is usually, but not always, takes more time to find where in the documentation the particular thing is located but in the cheat sheet you have it right in front of you. For example, suppose you were C programmer and forgot how to print a floating point number, you'd type `man 3 printf' and quickly find the answer. That's fine - it's as fast or even faster than looking something up in a cheat sheet but that's just because you knew exactly what you were looking for! But let's look at Joe Random who just began learning gawk programming language and at the moment wants to replace a part of a string with another string. He'd probably already have the manual open and would start going through it looking for string functions. It would take him a good minute or so before he finds the correct place and reads how the function works. Now, if he had a cheat sheet, like the one I created here, he'd have all the string functions in front of him and he would quickly locate that it is gsub or gensub. He'd find not only that but also all the other string functions and next time he has a problem he might remember the right function subconsciously.

What I want to emphasize is that the cheat sheets are not for NOT remembering things and just looking them up hundreds of thousands of times and never actually learning them but are FOR remembering and learning new things faster.

It is also interesting to find what technologies you have not used for a while because the cheat sheets pile up in most recently.

I made this cheat sheet in Microsoft Word because I am not that good with TeX, particularly formatting data so it looked nice.

So the awk cheat sheet is available in .doc and .pdf. I will usually post the cheat sheets in these two formats.

This cheat sheet contains:

  • Predefined Variable Summary, which lists all the predefined variables and which awk versions (original awk, nawk or gawk) have it built in,
  • GNU awk's command line argument summary,
  • I/O statements,
  • Numeric functions,
  • Bit manipulation functions,
  • I18N (internatiolization) functions,
  • String functions, and finally,
  • Time functions.

If you notice any inaccuracies, mistakes or see that I have left something out, comment on this page, please!

Download AWK Cheat-Sheet

PDF:
Download link: awk cheat sheet (.pdf)
Downloaded: 173054 times

Plain Text (.txt):
Download link: awk cheat sheet (.txt)
Downloaded: 106254 times

Microsoft Word 2000 format (.doc):
Download link: awk cheat sheet (.doc)
Downloaded: 6756 times

Are you interested in AWK programming language? Here are four great books on AWK from Amazon:

awk programming and youtubeAlright. Let's get things going on this blog. This is the first post and I am still getting the design right so bear with me.

As I mentioned in the lengthy 'About this blog' post, one of the things I love to do is figuring out how to get something done with the wrong set of tools. I love it because it teaches me the "dark corners" of those tools.

This will be a tutorial on how to download YouTube videos. I just love watching YouTube videos. One of the latest videos I watched was this brilliant commercial. Also I love programming and one of the languages I learned recently was the awk (the name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan) programming language. Would it be possible to download YouTube videos with awk?!

I do not want to go into the language details since that is not my goal today. If you want to learn this cool language check out this tutorial or these books.

The awk language originally does not have networking support so without using some networking tools which would create a network connection for us to pipe contents from YouTube to awk we would be out of luck. Also awk is not quite suited for handling binary data, so we will have to figure out how to read large amounts of binary data from the net in an efficient manner.

Let's find out what Google has to say about awk and networking. Quick search for 'awk + networking' gives us an interesting result - "TCP/IP Internetworking With `gawk'". Hey, wow! Just what we were looking for! Networking support for awk in GNU's awk implementation through special files!

Quoting the manual:

The special file name for network access is made up of several fields, all of which are mandatory:
/inet/protocol/localport/hostname/remoteport

Cool! We know that the web talks over the tcp protocol port 80 and we are accessing www.youtube.com for videos. So the special file for accessing YouTube website would be:

/inet/tcp/0/www.youtube.com/80

(localport is 0 because we are a client)

Now let's test this out and get the banner of the YouTube's webserver by making a HEAD HTTP request to the web server and reading the response back. The following script will get the HEAD response from YouTube:

BEGIN {
YouTube = "/inet/tcp/0/www.youtube.com/80"
print "HEAD / HTTP/1.0\r\n\r\n" |& YouTube
while ((YouTube |& getline) > 0)
  print $0
close(YouTube)
}

I saved this script to youtube.head.awk file and and run gawk from command line on my Linux box:

pkrumins@graviton:~$ gawk youtube.head.awk
HTTP/1.1 200 OK
Date: Mon, 09 Jul 2007 21:41:59 GMT
Server: Apache
...
[truncated]

Yeah! It worked!

Now, let's find out how YouTube embeds videos on their site. We know that the video is played with a flash player so html code which displays it must be present. Let's find it.
I'll go a little easy here so the users with less experience can learn something, too. Suppose we did not know how the flash was embedded in the page. How could we find it?

One way would be to notice that the title of the video is 'The Wind' and then search this string in the html source until we notice something like 'swf' which is extension for flash files, or 'flash'.

The other way would be to use a better tool like FireFox browser's FireBug extension and arrive at the correct place in source instantly without searching the source but by bringing up the FireBug's console and inspecting the emedded flash movie.

After doing this we would find that YouTube videos are displayed on the page by calling this JavaScript function which generates the appropriate html:

SWFObject("/player2.swf<strong>?hl=en&video_id=2mTLO2F_ERY&l=123&t=OEgsToPDskK5DwdDH6isCsg5GtXyGpTN&soff=1&sk=sZLEcvwsRsajGmQF7OqwWAU</strong>"

Visiting this URL http://www.youtube.com/player2.swf?hl=en... loads the video player in full screen. Not quite what we want. We want just the video file that is being played in the video player. How does this flash player load the video? There are two ways to find it out - use a network traffic analyzer like Wireshark (previously Ethereal) or disassembling their flash player using SoThink's SWF Decompiler (it's commercial, i don't know a free alternative. can be bought here) to see the ActionScript which loads the movie. I hope to show how to find the video file url using both of these methods in future posts.

UPDATE (2007.10.21): This is no longer true. Now YouTube gets videos by taking 'video_id' and 't' id from the following JavaScript object:

var swfArgs = {hl:'en',video_id:'xh_LmxEuFo8',l:'39',t:'OEgsToPDskKwChZS_16Tu1BqrD4fueoW',sk:'ZU0Zy4ggmf9MYx1oVLUcYAC'};

UPDATE (2008.03.01): This is no longer true. Now YouTube gets videos by taking 'video_id' and 't' id from the following JavaScript object:

var swfArgs = {"BASE_YT_URL": "http://youtube.com/", "video_id": "JJ51hx3wGgI", "l": 242, "sk": "sZLEcvwsRsajGmQF7OqwWAU", "t": "OEgsToPDskJfAwvlG0JDr8cO-HVq2RaB", "hl": "en", "plid": "AARHZ9SrFgUPvbFgAAAAcADYAAA", "e": "h", "tk": "KVRgpgeftCUWrYaeqpikCbNxXMXKmdUoGtfTNVkEouMjv1SwamY-Wg=="};

UPDATE (2009.08.25): This is also no longer true. Now YouTube gets videos by requesting it from one of the urls specified in 'fmt_url_map', which is located in the following JavaScript object:

var swfArgs = {"rv.2.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2FCSG807d3P-U%2Fdefault.jpg", "rv.7.length_seconds": "282", "rv.0.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DOF5T_7fDGgw", "rv.0.view_count": "2379471", "rv.2.title": "Banned+Commercials+-+Levis", "rv.7.thumbnailUrl": "http%3A%2F%2Fi3.ytimg.com%2Fvi%2FfbIdXn1zPbA%2Fdefault.jpg", "rv.4.rating": "4.87804878049", "length_seconds": "123", "rv.0.title": "Variety+Sex+%28LGBQT+Part+2%29", "rv.7.title": "Coke_Faithless", "rv.3.view_count": "2210628", "rv.5.title": "Three+sheets+to+the+wind%21", "rv.0.length_seconds": "364", "rv.4.thumbnailUrl": "http%3A%2F%2Fi3.ytimg.com%2Fvi%2F6IjUkNmUcHc%2Fdefault.jpg", "fmt_url_map": "18%7Chttp%3A%2F%2Fv22.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Cburst%252Cfactor%26itag%3D18%26ipbits%3D0%26signature%3D41B6B8B8FC0CF235443FC88E667A713A8A407AE7.CF9B5B68E39D488E61FE8B50D3BAEEF48A018A3C%26sver%3D3%26expire%3D1251270000%26key%3Dyt1%26factor%3D1.25%26burst%3D40%26id%3Dda64cb3b617f1116%2C34%7Chttp%3A%2F%2Fv19.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Cburst%252Cfactor%26itag%3D34%26ipbits%3D0%26signature%3DB6853342CDC97C85C83A872F9E5F274FE8B7B4A2.2B24E4836216C2F54428509388BC74043DB1782A%26sver%3D3%26expire%3D1251270000%26key%3Dyt1%26factor%3D1.25%26burst%3D40%26id%3Dda64cb3b617f1116%2C5%7Chttp%3A%2F%2Fv17.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Cburst%252Cfactor%26itag%3D5%26ipbits%3D0%26signature%3DB84AF2BE4ED222EC0217BA3149456F1164827F0C.1ECC42B7587411B734CC7B37209FDFA9A935391D%26sver%3D3%26expire%3D1251270000%26key%3Dyt1%26factor%3D1.25%26burst%3D40%26id%3Dda64cb3b617f1116", "rv.2.rating": "4.77608082707", "keywords": "the%2Cwind", "cr": "US", "rv.1.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dmp7g_8rEdg8", "rv.6.thumbnailUrl": "http%3A%2F%2Fi1.ytimg.com%2Fvi%2Fx-OqKWXirsU%2Fdefault.jpg", "rv.1.id": "mp7g_8rEdg8", "rv.3.rating": "4.14860864417", "rv.6.title": "best+commercial+ever", "rv.7.id": "fbIdXn1zPbA", "rv.4.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D6IjUkNmUcHc", "rv.1.title": "Quilmes+comercial", "rv.1.thumbnailUrl": "http%3A%2F%2Fi2.ytimg.com%2Fvi%2Fmp7g_8rEdg8%2Fdefault.jpg", "rv.3.title": "Viagra%21+Best+Commercial%21", "rv.0.rating": "3.79072164948", "watermark": "http%3A%2F%2Fs.ytimg.com%2Fyt%2Fswf%2Flogo-vfl106645.swf%2Chttp%3A%2F%2Fs.ytimg.com%2Fyt%2Fswf%2Fhdlogo-vfl100714.swf", "rv.6.author": "hbfriendsfan", "rv.5.id": "w0BQh-ICflg", "tk": "OK0E3bBTu64aAiJXYl2eScsjwe3ggPK1q1MXf7LPuwIFAjkL2itc1Q%3D%3D", "rv.4.author": "yaquijr", "rv.0.featured": "1", "rv.0.id": "OF5T_7fDGgw", "rv.3.length_seconds": "30", "rv.5.rating": "4.42047930283", "rv.1.view_count": "249202", "sdetail": "p%3Awww.catonmat.net%2Fblog%2Fdownload", "rv.1.author": "yodroopy", "rv.1.rating": "3.66379310345", "rv.4.title": "epuron+-+the+power+of+wind", "rv.5.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2Fw0BQh-ICflg%2Fdefault.jpg", "rv.5.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dw0BQh-ICflg", "rv.6.length_seconds": "40", "sourceid": "r", "rv.0.author": "kicesie", "rv.3.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2FKShkhIXdf1Y%2Fdefault.jpg", "rv.2.author": "dejerks", "rv.6.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dx-OqKWXirsU", "rv.7.rating": "4.51851851852", "rv.3.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DKShkhIXdf1Y", "fmt_map": "18%2F512000%2F9%2F0%2F115%2C34%2F0%2F9%2F0%2F115%2C5%2F0%2F7%2F0%2F0", "hl": "en", "rv.7.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DfbIdXn1zPbA", "rv.2.view_count": "9744415", "rv.4.length_seconds": "122", "rv.4.view_count": "162653", "rv.2.url": "http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DCSG807d3P-U", "plid": "AARyAMgw_jlMzIA7", "rv.5.length_seconds": "288", "rv.0.thumbnailUrl": "http%3A%2F%2Fi4.ytimg.com%2Fvi%2FOF5T_7fDGgw%2Fdefault.jpg", "rv.7.author": "paranoidus", "sk": "I9SvaNetkP1IR2k_kqJzYpB_ItoGOd2GC", "rv.5.view_count": "503035", "rv.1.length_seconds": "61", "rv.6.rating": "4.74616639478", "rv.5.author": "hotforwords", "vq": "None", "rv.3.id": "KShkhIXdf1Y", "rv.2.id": "CSG807d3P-U", "rv.2.length_seconds": "60", "t": "vjVQa1PpcFOeKDyjuF7uICOYYpHLyjaGXsro1Tsfao8%3D", "rv.6.id": "x-OqKWXirsU", "video_id": "2mTLO2F_ERY", "rv.6.view_count": "2778674", "rv.3.author": "stephancelmare360", "rv.4.id": "6IjUkNmUcHc", "rv.7.view_count": "4260"};

We need to extract these two ids and make a request string '?video_id=xh_LmxEuFo8&t=OEgsToPDskKwChZS_16Tu1BqrD4fueoW'. The rest of the article describes the old way YouTube handled videos (before update), but it is basically the same.

For now, I can tell you that once the video player loads it gets the FLA (flash movie) file from:

http://www.youtube.com/get_video<strong>?hl=en&video_id=2mTLO2F_ERY&l=123&t=OEgsToPDskK5DwdDH6isCsg5GtXyGpTN&soff=1&sk=sZLEcvwsRsajGmQF7OqwWAU</strong>

Where the string in bold after 'http://www.youtube.com/get_video' is the same that is in the previous fragment after player2.swf (both in bold).

If you now entered the url into a browser it should popup the download dialog and you should be able save the flash movie to your computer. But it's not that easy! YouTube actually 302 redirects you to one or two other urls before the video download actually starts! So we will have to handle these HTTP redirects in our awk script because awk does not know anything about HTTP protocol!

So basically all we have to do is construct an awk script which would find the request string (previously in bold), append it to 'http://www.youtube.com/get_video' and handle the 302 redirects and finally save the video data to file.

Since awk has great pattern matching built in already we can extract the request string (in bold previously) by getting html source of the video page then searching for a line which contains SWFObject("/player2.swf and extracting everything after the ? up to ".

So here is the final script. Copy or save it to 'get_youtube_vids.awk' file and then it can be used from command line as following:

gawk -f get_youtube_vids.awk <http://www.youtube.com/watch?v=ID1> [http://youtube.com/watch?v=ID2 | ID2] ...

For example, to download the video commercial which I told was great you'd call the script as:

gawk -f get_youtube_vids.awk http://www.youtube.com/watch?v=2mTLO2F_ERY

or just using the ID of the video:

gawk -f get_youtube_vids.awk 2mTLO2F_ERY

Here is the source code of the program:

#!/usr/bin/gawk -f
#
# 2007.07.10 v1.0 - initial release
# 2007.10.21 v1.1 - youtube changed the way it displays vids
# 2008.03.01 v1.2 - youtube changed the way it displays vids
# 2008.08.28 v1.3 - added a progress bar and removed need for --re-interval 
# 2009.08.25 v1.4 - youtube changed the way it displays vids
#
# Peteris Krumins (peter@catonmat.net)
# http://www.catonmat.net -- good coders code, great reuse
#
# Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ...
# or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1>
#

BEGIN {
    if (ARGC == 1) usage();

    BINMODE = 3

    delete ARGV[0]
    print "Parsing YouTube video urls/IDs..."
    for (i in ARGV) {
        vid_id = parse_url(ARGV[i])
        if (length(vid_id) < 6) { # havent seen youtube vids with IDs < 6 chars
            print "Invalid YouTube video specified: " ARGV[i] ", not downloading!"
            continue
        }
        VIDS[i] = vid_id
    }

    for (i in VIDS) {
        print "Getting video information for video: " VIDS[i] "..."
        get_vid_info(VIDS[i], INFO)

        if (INFO["_redirected"]) {
            print "Could not get video info for video: " VIDS[i]
            continue 
        }

        if (!INFO["video_url"]) {
            print "Could not get video_url for video: " VIDS[i]
            print "Please goto my website, and submit a comment with an URL to this video, so that I can fix it!"
            print "Url: http://www.catonmat.net/blog/downloading-youtube-videos-with-gawk/"
            continue
        }
        if ("title" in INFO) {
            print "Downloading: " INFO["title"] "..."
            title = INFO["title"]
        }
        else {
            print "Could not get title for video: " VIDS[i]
            print "Trying to download " VIDS[i] " anyway"
            title = VIDS[i]
        }
        download_video(INFO["video_url"], title)
    }
}

function usage() {
    print "Downloading YouTube Videos with GNU Awk"
    print
    print "Peteris Krumins (peter@catonmat.net)"
    print "http://www.catonmat.net  --  good coders code, great reuse"
    print 
    print "Usage: gawk -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
    print "or just ./get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
    exit 1
}

#
# function parse_url
#
# takes a url or an ID of a youtube video and returns just the ID
# for example the url could be the full url: http://www.youtube.com/watch?v=ID
# or it could be www.youtube.com/watch?v=ID
# or just youtube.com/watch?v=ID or http://youtube.com/watch?v=ID
# or just the ID
#
function parse_url(url) {
    gsub(/http:\/\//, "", url)                # get rid of http:// part
    gsub(/www\./,     "", url)                # get rid of www.    part
    gsub(/youtube\.com\/watch\?v=/, "", url)  # get rid of youtube.com... part

    if ((p = index(url, "&")) > 0)      # get rid of &foo=bar&... after the ID
        url = substr(url, 1, p-1)

    return url
}

#
# function get_vid_info
#
# function takes the youtube video ID and gets the title of the video
# and the url to .flv file
#
function get_vid_info(vid_id, INFO,    InetFile, Request, HEADERS, matches, escaped_urls, fmt_urls, fmt) {
    delete INFO
    InetFile = "/inet/tcp/0/www.youtube.com/80"
    Request = "GET /watch?v=" vid_id " HTTP/1.1\r\n"
    Request = Request "Host: www.youtube.com\r\n\r\n"

    get_headers(InetFile, Request, HEADERS)
    if ("Location" in HEADERS) {
        INFO["_redirected"] = 1
        close(InetFile)
        return
    }

    while ((InetFile |& getline) > 0) {
        if (match($0, /"fmt_url_map": "([^"]+)"/, matches)) {
            escaped_urls = url_unescape(matches[1])
            split(escaped_urls, fmt_urls, /,?[0-9]+\|/)
            for (fmt in fmt_urls) {
                if (fmt_urls[fmt] ~ /itag=5/) {
                    # fmt number 5 is the best video
                    INFO["video_url"] = fmt_urls[fmt]
                    close(InetFile)
                    return
                }
            }
            close(InetFile)
            return
        }
        else if (match($0, /<title>YouTube - ([^<]+)</, matches)) {
            # lets try to get the title of the video from html tag which is
            # less likely a subject to future html design changes
            INFO["title"] = matches[1]
        }
    }
    close(InetFile)
}

#
# function url_unescape
#
# given a string, it url-unescapes it.
# charactes such as %20 get converted to their ascii counterparts.
#
function url_unescape(str,    nmatches, entity, entities, seen, i) {
    nmatches = find_all_matches(str, "%[0-9A-Fa-f][0-9A-Fa-f]", entities)
    for (i = 1; i <= nmatches; i++) {
        entity = entities[i]
        if (!seen[entity]) {
            if (entity == "%26") { # special case for gsub(s, r, t), when r = '&'
                gsub(entity, "\\&", str)
            }
            else {
                gsub(entity, url_entity_unescape(entity), str)
            }
            seen[entity] = 1
        }
    }
    return str
}

#
# function find_all_matches
#
# http://awk.freeshell.org/FindAllMatches
#
function find_all_matches(str, re, arr,    j, a, b) {
    j=0
    a = RSTART; b = RLENGTH   # to avoid unexpected side effects

    while (match(str, re) > 0) {
        arr[++j] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART+RLENGTH)
    }
    RSTART = a; RLENGTH = b
    return j
}

#
# function url_entity_unescape
#
# given an url-escaped entity, such as %20, return its ascii counterpart.
#
function url_entity_unescape(entity) {
    sub("%", "", entity)
    return sprintf("%c", strtonum("0x" entity))
}

#
# function download_video
#
# takes the url to video and saves the movie to current directory using
# santized video title as filename
#
function download_video(url, title,    filename, InetFile, Request, Loop, HEADERS, FOO) {
    title = sanitize_title(title)
    filename = create_filename(title)

    parse_location(url, FOO)
    InetFile = FOO["InetFile"]
    Request  = "GET " FOO["Request"] " HTTP/1.1\r\n"
    Request  = Request "Host: " FOO["Host"] "\r\n\r\n"

    Loop = 0 # make sure we do not get caught in Location: loop
    do {     # we can get more than one redirect, follow them all
        get_headers(InetFile, Request, HEADERS)
        if ("Location" in HEADERS) { # we got redirected, let's follow the link
            close(InetFile)
            parse_location(HEADERS["Location"], FOO)
            InetFile = FOO["InetFile"]
            Request  = "GET " FOO["Request"] " HTTP/1.1\r\n"
            Request  = Request "Host: " FOO["Host"] "\r\n\r\n"
            if (InetFile == "") {
                print "Downloading '" title "' failed, couldn't parse Location header!"
                return
            }
        }
        Loop++
    } while (("Location" in HEADERS) && Loop < 5)

    if (Loop == 5) {
        print "Downloading '" title "' failed, got caught in Location loop!"
        return
    }
    
    print "Saving video to file '" filename "' (size: " bytes_to_human(HEADERS["Content-Length"]) ")..."
    save_file(InetFile, filename, HEADERS)
    close(InetFile)
    print "Successfully downloaded '" title "'!"
}

#
# function sanitize_title
#
# sanitizes the video title, by removing ()'s, replacing spaces with _, etc.
# 
function sanitize_title(title) {
    gsub(/\(|\)/, "", title)
    gsub(/[^[:alnum:]-]/, "_", title)
    gsub(/_-/, "-", title)
    gsub(/-_/, "-", title)
    gsub(/_$/, "", title)
    gsub(/-$/, "", title)
    gsub(/_{2,}/, "_", title)
    gsub(/-{2,}/, "-", title)
    return title
}

#
# function create_filename
#
# given a sanitized video title, creates a nonexisting filename
#
function create_filename(title,    filename, i) {
    filename = title ".flv"
    i = 1
    while (file_exists(filename)) {
        filename = title "-" i ".flv"
        i++
    }
    return filename
}

#
# function save_file
#
# given a special network file and filename reads from network until eof
# and saves the read contents into a file named filename
#
function save_file(Inet, filename, HEADERS,    done, cl, perc, hd, hcl) {
    OLD_RS  = RS
    OLD_ORS = ORS

    ORS = ""

    # clear the file
    print "" > filename

    # here we will do a little hackery to write the downloaded data
    # to file chunk by chunk instead of downloading it all to memory
    # and then writing
    #
    # the idea is to use a regex for the record field seperator
    # everything that gets matched is stored in RT variable
    # which gets written to disk after each match
    #
    # RS = ".{1,512}" # let's read 512 byte records

    RS = "@" # I replaced the 512 block reading with something better.
             # To read blocks I had to force users to specify --re-interval,
             # which made them uncomfortable.
             # I did statistical analysis on YouTube video files and
             # I found that hex value 0x40 appears pretty often (200 bytes or so)!
             #

    cl = HEADERS["Content-Length"]
    hcl = bytes_to_human(cl)
    done = 0
    while ((Inet |& getline) > 0) {
        done += length($0 RT)
        perc = done*100/cl
        hd = bytes_to_human(done)
        printf "Done: %d/%d bytes (%d%%, %s/%s)            \r",
            done, cl, perc, bytes_to_human(done), bytes_to_human(cl)
        print $0 RT >> filename
    }
    printf "Done: %d/%d bytes (%d%%, %s/%s)            \n",
        done, cl, perc, bytes_to_human(done), bytes_to_human(cl)

    RS  = OLD_RS
    ORS = OLD_ORS
}

#
# function get_headers
#
# given a special inet file and the request saves headers in HEADERS array
# special key "_status" can be used to find HTTP response code
# issuing another getline() on inet file would start returning the contents
#
function get_headers(Inet, Request,    HEADERS, matches, OLD_RS) {
    delete HEADERS

    # save global vars
    OLD_RS=RS

    print Request |& Inet

    # get the http status response
    if (Inet |& getline > 0) {
        HEADERS["_status"] = $2
    }
    else {
        print "Failed reading from the net. Quitting!"
        exit 1
    }

    RS="\r\n"
    while ((Inet |& getline) > 0) {
        # we could have used FS=": " to split, but i could not think of a good
        # way to handle header values which contain multiple ": "
        # so i better go with a match
        if (match($0, /([^:]+): (.+)/, matches)) {
            HEADERS[matches[1]] = matches[2]
        }
        else { break }
    }
    RS=OLD_RS
}

#
# function parse_location
#
# given a Location HTTP header value the function constructs a special
# inet file and the request storing them in FOO
#
function parse_location(location, FOO) {
    # location might look like http://cache.googlevideo.com/get_video?video_id=ID
    if (match(location, /http:\/\/([^\/]+)(\/.+)/, matches)) {
        FOO["InetFile"] = "/inet/tcp/0/" matches[1] "/80"
        FOO["Host"]     = matches[1]
        FOO["Request"]  = matches[2]
    }
    else {
        FOO["InetFile"] = ""
        FOO["Host"]     = ""
        FOO["Request"]  = ""
    }
}

# function bytes_to_human
#
# given bytes, converts them to human readable format like 13.2mb
#
function bytes_to_human(bytes,    MAP, map_idx, bytes_copy) {
    MAP[0] = "b"
    MAP[1] = "kb"
    MAP[2] = "mb"
    MAP[3] = "gb"
    MAP[4] = "tb"
   
    map_idx = 0
    bytes_copy = int(bytes)
    while (bytes_copy > 1024) {
        bytes_copy /= 1024
        map_idx++
    }

    if (map_idx > 4)
        return sprintf("%d bytes", bytes, MAP[map_idx])
    else
        return sprintf("%.02f%s", bytes_copy, MAP[map_idx])
}

#
# function file_exists
#
# given a path to file, returns 1 if the file exists, or 0 if it doesn't
#
function file_exists(file,    foo) {
    if ((getline foo <file) >= 0) {
        close(file)
        return 1
    }
    return 0
}

Each function is well documented so the code should be easy to understand. If you see something can be improved or optimized, just comment on this page. Also if you would like that I explain each fragment of the source code in even more detail, let me know.

The most interesting function in this script is save_file which does chunked downloading in a hacky way (see the comments in the source to see how)

Download

Download link: gawk youtube video downloader
Total downloads: 26693 times

Are you interested in AWK programming language? Here are four great books on AWK from Amazon:

about this blog icon

This is an old post. Most of the things written here are no longer true or I don't care about them. I left it unedited for historic purposes, so I can return after 10 years and see what I was up to back when I wrote it.

Welcome to my blog, dear reader.

I feel great that I have finally launched this blog because it was my dream to launch it two years ago when I bought this domain catonmat.net. Hackers love cats, don't they? lolcatz. ;)

So why did I start this blog and what am I going to write here?

First of all let me introduce myself a little (more information about me on about page).

My name is Peteris Krumins (pronounced as Peter-is Kroo-mins) which directly translates to English as Peter Bush. (Hi, George!)

I am 22 years and I am finishing my Physics degree the next year. There is a good reason why I did not choose Computer science as my major. By the time I finished high school I already had great work work experience as a programmer, linux sysadmin and a white hat (computer security). After I get a B.Sc. in physics I am going for a M.Sc. degree in Computer Science, hopefully at MIT. I find Computer Science M.Sc. programme much more challenging than B.Sc programme and consider it worth spending time on. I am applying to MIT this autumn. MIT has been my dream for a few years now since I found MIT's OCW video lectures and saw how cool they are. (Little self promotion: I loved the video lectures so much that I started a free video lecture blog and free science video clip website which have become pretty popular).

So I have really good understanding of programming and have found two great approaches to software development. These approaches seem to be pretty well known and someone might say that they are wrong but I am going to show that they are not wrong and are efficient.

One of them I call "the hacker's approach" which basically means you just create cool stuff quickly that works by using anything you have available. And if something doesn't work you just get it working asap without getting into much details why it didn't work. Might sound pretty lame but I am going to demonstrate in some of my posts that you can quickly create cool tools and software with little effort.

The other approach is what the title of this blog says "good coders code, great reuse". This approach is similar to "the hacker's approach" but it does not require that you create your software quickly and dirty. It just means you reuse existing libraries and spend little effort writing stuff that has already been written. Software reuse so to say. There are so many libraries and code available on sourceforge, code.google, freshmeat and many other sites that most of all the solutions to problems are there.

By using these two approaches I will create some great free software and will try to monetize it so I can make enough money for MIT :)

I have been reading about Internet Marketing for the last year so I will definitely do a few software project from A-Z to learn more about software marketing and to teach you how it can be done. I will create open source software and document each step on this blog - what I learned, how I did it, how it worked and how I marketed it.

Also I have this list of 50 or so cool software ideas I have thought of during the last 3 years while studying physics which I have not coded myself but will theorize on how to implement and actually implement some of them, and write about what alternatives we have for them at the moment. For example, English is not my native language so I sometimes struggle getting grammar right. When in doubt, I usually query Google with the grammar form in question to find which form would return more hits. The one with the most hits probably has higher change of being correct. So I have an idea of writing a tool which checks grammar or expressions in any language based on how many hits Google returns. Or another example, there is this viemu software which emulates vi key bindings in Visual Studio, how hard would it be embedding vim in Visual Studio? I don't think it's that hard, probably some days of hacking to understand Visual Studio SDK and you get a fully fledged vim in Visual Studio. You get the idea.

Some of my posts will be how to do things with the wrong tools. A topic which I really love myself. Could be considered a hacker's approach because a real hacker would need to figure something out to accomplish the job if he did not have the right tools. The usual scenario for doing something with a wrong tool is choosing some tool and setting a goal to do something with it. For example, I was recently learning AWK programming language for fun and after a few hours of hacking I decided to watch a YouTube video just for relaxation. I love watching YouTube videos. Reasonably, I would like to download some of my favorite videos so I could view them on my laptop or phone. I thought wouldn't it be great if I could could download the videos with AWK? A quick Google search for awk + networking brought up results with GNU implementation of AWK - GAWK. And this implementation has networking support. So after a few tens of minutes I found a way to download my favorite YouTube videos with AWK. Pretty neat, huh? :) (I already wrote this tool, here is the post - downloading youtube videos with gawk)

Alright, I mentioned I had worked as whitehat. Really cool job to hack and to get paid for it. During my practice I found that many boxes were "secured" through obscurity. For example, having an internal web server (for outside access) forwarded on port like 31234. Surely, this is no security. So I came with this idea of hacking through obscurity. I just had to name regular hacking in an interesting way. I will write about it. It's similar to the idea of doing things with wrong tools. Here is an example, suppose you hacked a box and wanted to make sure your rootkit never got deleted. One step would be to put the script which installs the rootkit in the rc scripts but that's easily detectable because it's the first place to look for things like that. What about putting the script in less known places like bash shell's variable (PROMPT_COMMAND) which contents get eval'ed before each command? I call this hacking through obscurity. This is, of course, also easily detectable but you learn something cool not many other people know. Kinda cool. Heh. Heh :)

And one of the final things I am going to post are programming cheat-sheets. I love cheat-sheets. I have at least 30 on my desk from which I have made 15 myself. Stuff like Perl's predefined variables, C and C++ operator precedence and associativity tables and many others. (Here is the first post abou cheat sheets - awk cheat sheet)

I LOVE GOOGLE. I LOVE GOOGLE. I LOVE GOOGLE. I LOVE GOOGLE.
Yes, I love Google! Google is everything for me. I love how smart they are, I love their products, I love their technology, I love that I can make money from them, I love that I can advertise with them, I love their tech-talk videos, I love their geek jokes, I love that they are the best company to work for, and I love absolutely everything about them!

Thanks for reading the first post! Don't forget to make first comment! :)