Alright. Let’s get things going on this blog. This is the first post and I am still getting the design right so bear with me.
As I mentioned in the lengthy ‘About this blog‘ post, one of the things I love to do is figuring out how to get something done with the wrong set of tools. I love it because it teaches me the “dark corners” of those tools.
This will be a tutorial on how to download YouTube videos. I just love watching YouTube videos. One of the latest videos I watched was this brilliant commercial. Also I love programming and one of the languages I learned recently was the awk (the name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan) programming language. Would it be possible to download YouTube videos with awk?!
I do not want to go into the language details since that is not my goal today. If you want to learn this cool language check out this tutorial or these books.
The awk language originally does not have networking support so without using some networking tools which would create a network connection for us to pipe contents from YouTube to awk we would be out of luck. Also awk is not quite suited for handling binary data, so we will have to figure out how to read large amounts of binary data from the net in an efficient manner.
Let’s find out what Google has to say about awk and networking. Quick search for ‘awk + networking‘ gives us an interesting result - “TCP/IP Internetworking With `gawk’“. Hey, wow! Just what we were looking for! Networking support for awk in GNU’s awk implementation through special files!
Quoting the manual:
The special file name for network access is made up of several fields, all of which are mandatory:
/inet/protocol/localport/hostname/remoteport
Cool! We know that the web talks over the tcp protocol port 80 and we are accessing www.youtube.com for videos. So the special file for accessing YouTube website would be:
/inet/tcp/0/www.youtube.com/80
(localport is 0 because we are a client)
Now let’s test this out and get the banner of the YouTube’s webserver by making a HEAD HTTP request to the web server and reading the response back. The following script will get the HEAD response from YouTube:
BEGIN {
YouTube = "/inet/tcp/0/www.youtube.com/80"
print "HEAD / HTTP/1.0\r\n\r\n" |& YouTube
while ((YouTube |& getline) > 0)
print $0
close(YouTube)
}
I saved this script to youtube.head.awk file and and run gawk from command line on my Linux box:
pkrumins@graviton:~$ gawk youtube.head.awk HTTP/1.1 200 OK Date: Mon, 09 Jul 2007 21:41:59 GMT Server: Apache ... [truncated]
Yeah! It worked!
Now, let’s find out how YouTube embeds videos on their site. We know that the video is played with a flash player so html code which displays it must be present. Let’s find it.
I’ll go a little easy here so the users with less experience can learn something, too. Suppose we did not know how the flash was embedded in the page. How could we find it?
One way would be to notice that the title of the video is ‘The Wind’ and then search this string in the html source until we notice something like ’swf’ which is extension for flash files, or ‘flash’.
The other way would be to use a better tool like FireFox browser’s FireBug extension and arrive at the correct place in source instantly without searching the source but by bringing up the FireBug’s console and inspecting the emedded flash movie.
After doing this we would find that YouTube videos are displayed on the page by calling this JavaScript function which generates the appropriate html:
SWFObject("/player2.swf?hl=en&video_id=2mTLO2F_ERY&l=123&t=OEgsToPDskK5DwdDH6isCsg5GtXyGpTN&soff=1&sk=sZLEcvwsRsajGmQF7OqwWAU"
Visiting this URL http://www.youtube.com/player2.swf?hl=en… loads the video player in full screen. Not quite what we want. We want just the video file that is being played in the video player. How does this flash player load the video? There are two ways to find it out - use a network traffic analyzer like Wireshark (previously Ethereal) or disassembling their flash player using SoThink’s SWF Decompiler (it’s commercial, i don’t know a free alternative. can be bought here) to see the ActionScript which loads the movie. I hope to show how to find the video file url using both of these methods in future posts.
UPDATE (2007.10.21): This is no longer true. Now YouTube gets videos by taking ‘video_id‘ and ‘t‘ id from the following JavaScript object:
var swfArgs = {hl:'en',video_id:'xh_LmxEuFo8',l:'39',t:'OEgsToPDskKwChZS_16Tu1BqrD4fueoW',sk:'ZU0Zy4ggmf9MYx1oVLUcYAC'};
UPDATE (2008.03.01): This is no longer true. Now YouTube gets videos by taking ‘video_id‘ and ‘t‘ id from the following JavaScript object:
var swfArgs = {"BASE_YT_URL": "http://youtube.com/", "video_id": "JJ51hx3wGgI", "l": 242, "sk": "sZLEcvwsRsajGmQF7OqwWAU", "t": "OEgsToPDskJfAwvlG0JDr8cO-HVq2RaB", "hl": "en", "plid": "AARHZ9SrFgUPvbFgAAAAcADYAAA", "e": "h", "tk": "KVRgpgeftCUWrYaeqpikCbNxXMXKmdUoGtfTNVkEouMjv1SwamY-Wg=="};
We need to extract these two ids and make a request string ‘?video_id=xh_LmxEuFo8&t=OEgsToPDskKwChZS_16Tu1BqrD4fueoW‘. The rest of the article describes the old way YouTube handled videos (before update), but it is basically the same.
For now, I can tell you that once the video player loads it gets the FLA (flash movie) file from:
http://www.youtube.com/get_video?hl=en&video_id=2mTLO2F_ERY&l=123&t=OEgsToPDskK5DwdDH6isCsg5GtXyGpTN&soff=1&sk=sZLEcvwsRsajGmQF7OqwWAU
Where the string in bold after ‘http://www.youtube.com/get_video’ is the same that is in the previous fragment after player2.swf (both in bold).
If you now entered the url into a browser it should popup the download dialog and you should be able save the flash movie to your computer. But it’s not that easy! YouTube actually 302 redirects you to one or two other urls before the video download actually starts! So we will have to handle these HTTP redirects in our awk script because awk does not know anything about HTTP protocol!
So basically all we have to do is construct an awk script which would find the request string (previously in bold), append it to ‘http://www.youtube.com/get_video’ and handle the 302 redirects and finally save the video data to file.
Since awk has great pattern matching built in already we can extract the request string (in bold previously) by getting html source of the video page then searching for a line which contains SWFObject(”/player2.swf and extracting everything after the ? up to “.
So here is the final script. Copy or save it to ‘get_youtube_vids.awk’ file and then it can be used from command line as following:
gawk --re-interval -f get_youtube_vids.awk <http://www.youtube.com/watch?v=ID1> [http://youtube.com/watch?v=ID2 | ID2] ...
For example, to download the video commercial which I told was great you’d call the script as:
gawk --re-interval -f get_youtube_vids.awk http://www.youtube.com/watch?v=2mTLO2F_ERY
or just using the ID of the video:
gawk --re-interval -f get_youtube_vids.awk 2mTLO2F_ERY
The --re-interval option is needed because I use interval expression in a regex for doing chunked i/o when saving video to disk in save_disk function.
#!/usr/bin/gawk -f
#
# 2007.07.10 v1.0 - initial release
# 2007.10.21 v1.1 - youtube changed the way it displays vids
# 2008.03.01 v1.2 - youtube changed the way it displays vids
#
# Peteris Krumins (peter@catonmat.net)
# http://www.catonmat.net - good coders code, great reuse
#
# Usage: gawk --re-interval -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ...
#
BEGIN {
if (ARGC == 1) usage();
if ("fooooo" !~ "o{5}") {
print "Error: --re-interval option was not specified!"
print
usage();
}
BINMODE = 3
delete ARGV[0]
print "Parsing YouTube video urls/IDs"
for (i in ARGV) {
vid_id = parse_url(ARGV[i])
if (length(vid_id) < 6) { # havent seen youtube vids with IDs < 6 chars
print "Invalid YouTube video specified: '" ARGV[i] "', not downloading"
continue
}
VIDS[i] = vid_id
}
for (i in VIDS) {
get_vid_info(VIDS[i], INFO)
if (!INFO["request"]) {
print "Could not get request string for " VIDS[i]
continue
}
if ("title" in INFO) {
print "Downloading " INFO["title"]
title = INFO["title"]
}
else {
print "Could not get title for " VIDS[i]
print "Trying to download " VIDS[i] " anyway"
title = VIDS[i]
}
download_video(INFO["request"], title)
}
}
function usage() {
print "Downloading YouTube videos with gawk"
print "http://www.catonmat.net - good coders code, great reuse"
print
print "Usage: gawk --re-interval -f get_youtube_vids.awk <http://youtube.com/watch?v=ID1 | ID1> ..."
exit 1
}
#
# function parse_url
#
# takes a url or an ID of a youtube video and returns just the ID
# for example the url could be the full url: http://www.youtube.com/watch?v=ID
# or it could be www.youtube.com/watch?v=ID
# or just youtube.com/watch?v=ID or http://youtube.com/watch?v=ID
# or just the ID
#
function parse_url(url) {
gsub(/http:\/\//, "", url) # get rid of http:// part
gsub(/www\./, "", url) # get rid of www. part
gsub(/youtube\.com\/watch\?v=/, "", url) # get rid of youtube.com... part
if ((p = index(url, "&")) > 0) # get rid of &foo=bar&... after the ID
url = substr(url, 1, p)
return url
}
#
# function get_vid_info
#
# function takes the youtube video ID and gets the title of the video
# and request string to .flv video file
#
function get_vid_info(vid_id, INFO) {
YouTube = "/inet/tcp/0/www.youtube.com/80"
Request = "GET /watch?v=" vid_id " HTTP/1.0\r\n\r\n"
print Request |& YouTube
while ((YouTube |& getline) > 0) {
if (match($0, /"video_id": "([^"]+)".+"t": "([^"]+)"/, matches)) {
# we found the request string
#
INFO["request"] = "video_id=" matches[1] "&t=" matches[2]
}
else if (match($0, /<title>YouTube - ([^<]+)</, matches)) {
# lets try to get the title of the video from html tag which is
# less likely a subject to future design changes
INFO["title"] = matches[1]
# the other possibility is to get it from deeper html source:
#else if (match($0, /<h1 id="video_title">([^<]+)</, matches)) {
}
}
close(YouTube)
}
#
# function download_video
#
# takes the request string and saves the movie to current directory using
# santized video title as filename
#
function download_video(req, title) {
# santize the filename, replace all nonalnum chars with _, drop ()'s
filename = title
gsub(/\(|\)/, "", filename)
gsub(/[^[:alnum:]]/, "_", filename)
filename = filename ".flv"
print "Saving video to " filename
InetFile = "/inet/tcp/0/www.youtube.com/80"
# I tried getting the video using HTTP/1.0 but it simply didnt work
# so, HTTP/1.1 and the required Host header value
Request = "GET /get_video?" req " HTTP/1.1\r\n"
Request = Request "Host: www.youtube.com\r\n\r\n"
Loop = 0 # make sure we do not get caught in Location: loop
do { # we can get more than one redirect, follow them all
delete HEADERS
get_headers(InetFile, Request, HEADERS)
if ("Location" in HEADERS) { # we got redirected, let's follow the link
close(InetFile)
parse_location(HEADERS["Location"], FOO)
InetFile = FOO["InetFile"]
Request = "GET " FOO["Request"] " HTTP/1.1\r\n"
Request = Request "Host: " FOO["Host"] "\r\n\r\n"
if (InetFile == "") {
print "Downloading " title " failed, couldnt parse Location header"
return
}
}
Loop++
} while (("Location" in HEADERS) && Loop < 5)
if (Loop == 5) {
print "Downloading " title " failed, got caught in Location loop"
return
}
save_file(InetFile, filename)
close(InetFile)
print "Successfully downloaded " title
}
#
# function save_file
#
# given a special network file and filename reads from network until eof
# and saves the read contents into a file named filename
#
function save_file(Inet, filename) {
OLD_RS = RS
OLD_ORS = ORS
ORS = ""
# clear the file
print "" > filename
# here we will do a little hackery to write the downloaded data
# to file chunk by chunk instead of downloading it all to memory
# and then writing
#
# the idea is to use a regex for the record field seperator
# everything that gets matched is stored in RT variable
# which gets written to disk after each match
#
RS = ".{1,512}" # let's read 512 byte records
while ((Inet |& getline) > 0)
print RT >> filename
RS = OLD_RS
ORS = OLD_ORS
}
#
# function get_headers
#
# given a special inet file and the request saves headers in HEADERS array
# special key "_status" can be used to find HTTP response code
# issuing another getline() on inet file would start returning the contents
#
function get_headers(Inet, Request, HEADERS) {
# save global vars
OLD_RS=RS
print Request |& Inet
# get the http status response
if (Inet |& getline > 0) {
HEADERS["_status"] = $2
}
else {
print "Failed reading from the net. Quitting!"
exit 1
}
RS="\r\n"
while ((Inet |& getline) > 0) {
# we could have used FS=": " to split, but i could think of a good
# way to handle header values which contain multiple ": "
# so i better go with a match
if (match($0, /([^:]+): (.+)/, matches)) {
HEADERS[matches[1]] = matches[2]
}
else { break }
}
RS=OLD_RS
}
#
# function parse_location
#
# given a Location HTTP header value the function constructs a special
# inet file and the request storing them in FOO
#
function parse_location(location, FOO) {
# location might look like http://cache.googlevideo.com/get_video?video_id=ID
if (match(location, /http:\/\/([^\/]+)(\/.+)/, matches)) {
FOO["InetFile"] = "/inet/tcp/0/" matches[1] "/80"
FOO["Host"] = matches[1]
FOO["Request"] = matches[2]
}
else {
FOO["InetFile"] = ""
FOO["Host"] = ""
FOO["Request"] = ""
}
}
Each function is well documented so the code should be easy to understand. If you see something can be improved or optimized, just comment on this page. Also if you would like that I explain each fragment of the source code in even more detail, let me know.
The most interesting function in this script is save_file which does chunked downloading in a hacky way (see the comments in the source to see how)
Download link: gawk youtube video downloader
Total downloads: 2887 times
Are you interested in AWK programming language? Here are four great books on AWK from Amazon:
Did you like this post? Subscribe to my posts!

(7 votes, average: 4.43 out of 5)
|
|
|


July 22nd, 2007 at 4:23 am
[…] time I explained how YouTube videos can be downloaded with gawk programming language by getting the YouTube page where the video is displayed and finding out how the flash video player […]
July 26th, 2007 at 6:40 pm
[…] the few previous posts we have been downloading YouTube videos with Awk and Perl. What we ended up with were .flv (Flash Media Video) files which do not play with the […]
July 27th, 2007 at 12:04 pm
[…] program is based on the idea explained in Peteris Krumins’s article Downloading Youtube videos with gawk.But I don’t know any gawk(巨拗å£ï¼Œin Chinese),but I just learn a little Ruby.So I decided […]
July 28th, 2007 at 11:09 am
[…] Youtube downloader,Ruby 写的,由于对 Ruby çš„æ— çŸ¥ï¼Œä¸€ç‚¹ä¹Ÿçœ‹ä¸æ‡‚,所以就å‚考 DownloadYoutube video with GAWK 用 python 自己写了一个。与 Cheng Meng 那个ä¸åŒï¼Œæˆ‘这个åªè´Ÿè´£æŠŠè§†é¢‘文件 url […]
August 9th, 2007 at 7:58 am
One of the best (or at least brightest) perl programers, Randal Schwartz (Merlyn) started as an AWK expert programer.
I love AWK, Bash and perl with the same intensity!
Keep on writing such interesting articles!
Alberto
August 9th, 2007 at 8:11 am
Thanks, Chanio! I will definitely keep writing interesting articles
August 25th, 2007 at 3:26 am
[…] Downloading YouTube videos with gawk (tags: youtube video multimedia gawk awk unix linux opensource hacks sysadmin programming) […]
August 29th, 2007 at 9:12 pm
VERY GOOD
September 4th, 2007 at 10:34 am
I am seeing another Torvalds in the making…anyone else also thinks the same way
Nice blog..
September 4th, 2007 at 4:44 pm
Credence, I am very thankful for your kind comparison!
October 19th, 2007 at 2:58 am
I have been to your site for just around 30 minutes and you have raised an immense interest in me to learn programming and the basics of internet. I not only see a great programmer in you, but also a great teacher. Hope, I will continue to learn from you and your blogs.
Btw, I already have a query. I’ve seen tools available to download just the audio from a youtube video, in various formats; but as per your explanation it seems, that the audio is integrated with the video in the .swf file. How can we extract only the audio part and have it converted to a format like mp3?
October 31st, 2007 at 5:53 pm
i like it?
November 3rd, 2007 at 5:32 am
Download any videos on youtube.com site with 5X fast speed
Convert the flv video to various formats, including MPEG4
Auto transfer video to iPod, iPhone, Pocket PC, PSP, or Zune
Schedule the download and conversion tasks to be executed automatically
Product page: youtuberobot.com
Direct download link: youtuberobot.com/download/utuberobot.exe
Company web-site: youtuberobot.com
E-mail: support@youtuberobot.com
February 21st, 2008 at 6:20 pm
Hi,
youtube changed it again. To download i made a little change in function get_vid_info
BR,
Werner.
February 22nd, 2008 at 11:23 pm
With regards to the new code Mr. Werner posted on February 21st, 2008 at 6:20 pm
I think a few lines of code for the function get_vid_info are missing starting with the line:
else if (match($0, /YouTube - ([^([^ filename
February 24th, 2008 at 5:27 pm
There are even more lines missing, sorry. I have no idea why…
However, just take the original code and update the first if statement in get_vid_info by the following:
if (match($0, /"video_id"[ \t]*:[ \t]*"([^"]+)".+"t"[ \t]*:[ \t]*"([^"]+)"/, matches)) {BR,
Werner.
February 25th, 2008 at 8:39 pm
is there just a button that you click and it downloads that program for you? Then you can just download any video from youtube you want??? Please i need this…..
March 2nd, 2008 at 1:09 am
Elmer Fittery and Werner, I have uploaded a new version of gawk youtube downloader. Thanks for noticing the broken version!
March 16th, 2008 at 5:50 am
I wrote a video downloader with a friend in ruby. Youtube is also handled (18+ urls also). You can check the source for comparision, it’s hosted on code.google.com/p/mget
The project site: http://movie-get.org
There are *several* other hosting sites we support, check it out
March 16th, 2008 at 2:04 pm
That’s pretty cool, nice job. I was a little disappointed, though, that you put all the logic in the BEGIN pattern. It seems like you do all the pattern matching in a procedural manner inside the functions themselves rather than using the pattern/action syntax which is a bit more natural for AWK programs.
April 22nd, 2008 at 1:25 am
holy crap this is cool!
this is the only
simple youtube dl’er i’ve found that gives you a human readable filename!
that feature alone makes this one very cool.
the other thing that’s nice is gawk is a small package without so many possible config snafus.
python, perl, ruby, etc. are quite large and there’s more possibility of some config problem rearing up.
a python dl’er is my 2d fav to this one. but the filename thing is still an annoyance with the python solution i was using.
May 16th, 2008 at 10:00 am
YouTubeRobot.com today announces YouTube Robot 2.0, a tool that enables you to download video from YouTube.com onto your PC, convert it to various formats to watch it when you are on the road on mobile devices like mobile phone, iPod, iPhone, Pocket PC, PSP, or Zune.
YouTube Robot allows you to search for videos using keywords or browse video by category, author, channel, language, tags, etc. When you find something noteworthy, you can preview the video right in YouTube Robot and then download it onto the hard disk drive. The speed, at which you will be downloading, is very high: up to 5 times faster than other software when you download a single file and up to 4 times faster when you download multiple files at a time.
Manual download is not the only option with YouTube Robot. You may as well schedule the download and conversion tasks to be executed automatically, even when you are not around. Downloading is followed by conversion to the format of your choice and uploading videos to a mobile device (if needed). For example, you can plug in iPod, select the video, go to bed, and when you wake up next morning, your iPod will be ready to play new YouTube videos.
Product page: 3w.youtuberobot.com
Direct download link: 3w.youtuberobot.com/download/utuberobot.exe
Company web-site: 3w.youtuberobot.com
E-mail: support@youtuberobot.com