Friday, May 14, 2010

Python Web Scraping Tools

Just a note for myself, a list of interesting Python tools for my next web scraping project:

  • urllib2: extensible library for opening URLs.
  • PyQuery: jQuery-like traversing and selecting for Python.
  • mechanize: stateful programmatic web browsing in Python.
  • Beautiful Soup: not supported/maintained that much anymore. Latest versions are rather slow and buggy.
  • Scrapy: looks nice, includes the URL requesting part as well, with cookie support and such.
  • lxml.html: lxml is a Pythonic binding for the libxml2 and libxslt libraries.
Probably going with Scrapy.