Scraping guides: Values, separated by commas

When we revamped our documentation a while ago, we promised guides to specific scraper libraries, such as lxml, Nokogiri and so on.

We’re now staring to roll those out. The first one is simple, but a good one. Go to the documentation page and you’ll find a new section called “scraping guides”.

The CSV scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

“CSV” stands for “comma separated value”. It’s a basic but quite common format for publishing spreadsheet files. Take a look at the scrapers tagged CSV on ScraperWiki. Lots of ministerial meetings and government spending records.

The CSV scraping guide shows you how to download and parse a CSV file, and how to easily save it into the datastore.

Why write a scraper just to load a CSV file in? First there are quirks you’ll inevitably find – inconsistencies in fields, extra rows and columns, dates that you want formatting and so on. Secondly, you might want to load in multiple files and merge them together. Finally, you get the data in ScraperWiki, giving you a JSON API and letting you make views.

You can do quite funky things with even just CSV scraping. For example, see Nicola’s @scraper_no10 Twitter bot.

Next time, Excel files.

This entry was posted in developer and tagged , , . Bookmark the permalink.

One Response to Scraping guides: Values, separated by commas

  1. Pingback: Huge list of data journalism resources | clairemiller.net

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s