The primary duty of an exception handler is to get the error out of the lap of the programmer and into the surprised face of the user. Provided you keep this cardinal rule in mind, you can't go far wrong.
The code was just a snippet of some other code I have, but in regards to your points:
1. There are only 3 points where errors can creep in that I see:
1- if the urlopen fails or
2- during the htmlparser() if the html is super-malformed (same w/ BeautifulSoup).
3- if google changes their html format(but that will screw up almost any scraper)
The xpath and rest of the code will be work without problem since xpath will return '[]' if the xpath fails.
4. If you change the line:
tree=et.fromstring(results,et.HTMLParser())
to:
tree=et.fromstring(results,et.HTMLParser(recover=True))
then lxml handles malformed html almost as well as BeautifulSoup.
Anyways, I enjoy your blog, and just thought that I'd throw that out there.
I am being sponsored by Syntress! They bought me an amazing dedicated server to run catonmat on. If you're looking web services, I highly recommend the Syntress guys!
Education Sponsors
I am being sponsored by A-Writer! If you ever need help with essay writing, look no further than A-Writer! They will help you with your writing in as quickly as 3 hours!
The code was just a snippet of some other code I have, but in regards to your points:
1. There are only 3 points where errors can creep in that I see:
1- if the urlopen fails or
2- during the htmlparser() if the html is super-malformed (same w/ BeautifulSoup).
3- if google changes their html format(but that will screw up almost any scraper)
The xpath and rest of the code will be work without problem since xpath will return '[]' if the xpath fails.
4. If you change the line:
tree=et.fromstring(results,et.HTMLParser())
to:
tree=et.fromstring(results,et.HTMLParser(recover=True))
then lxml handles malformed html almost as well as BeautifulSoup.
Anyways, I enjoy your blog, and just thought that I'd throw that out there.
Reply To This Comment