We’ve added a new scraping copy-and-paste guide, so you can quickly get the lines of code you need to parse an HTML file using CSS selectors. Get to it from the documentation page:
The HTML parsing guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.
While the library used varies (lxml in Python, Nokogiri in Ruby, Simple HTML DOM in PHP), the principle is the same. You pull the text out of the page the way as you use CSS to style a page.
It’s a popular technique – for example, around 30% of Python scrapers on ScraperWiki use lxml.