Journalists, academics and budding open data hackers often praise ScraperWiki for making web scraping easy. And while it’s true our platform and powerful APIs let you get more done, more easily, the statement still creates some head-scratching at ScraperWiki HQ.
That’s because, as far as we can tell, scraping is hard, no matter what platform you’re using.
For example, let’s pretend you’re scraping a fairly ordinary web page that has some data as a table. Barely a sentence in and we already need to know about HTML and URLs. We need to access this page programmatically, so we need to pick a language to write a scraper in. Say Python. How do we select the elements we need from the table? A CSS selector. The header is blue, so how do we detect the colour of an element? RGB hex-triples…
A little bit more thinking like this leads to something like this diagram:
If you need to know web scraping, you need to know all that. Admittedly, you don’t need to be an expert (not for most scraping tasks), but you do need to know at least a little bit about lots of things before you can even begin to get something useful out of a web page.