You're viewing a comment by chad and its responses.

May 05, 2009, 01:32

The code was just a snippet of some other code I have, but in regards to your points:

1. There are only 3 points where errors can creep in that I see:
1- if the urlopen fails or
2- during the htmlparser() if the html is super-malformed (same w/ BeautifulSoup).
3- if google changes their html format(but that will screw up almost any scraper)

The xpath and rest of the code will be work without problem since xpath will return '[]' if the xpath fails.

4. If you change the line:
tree=et.fromstring(results,et.HTMLParser())
to:
tree=et.fromstring(results,et.HTMLParser(recover=True))
then lxml handles malformed html almost as well as BeautifulSoup.

Anyways, I enjoy your blog, and just thought that I'd throw that out there.

Reply To This Comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the word "network": (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.