Scraping guides: Excel spreadsheets

Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page.

The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

As with CSV files, at first it seems odd to be scraping Excel spreadsheets, when they’re already at least semi-structured data. Why would you do it?

The format of Excel files can varies a lot – how columns are arranged, where tables appear, what worksheets there are. There can be errors and inconsistencies that are easiest to fix in code. Sometimes you’ll find the data is there but not formatted in cells – entire rows in one cell, or data stored in notes.

We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News.

Dave Hughes has one that converts a spreadsheet from an FOI request, making a nice dataset of temperatures in Cambridge’s botanical garden.

This merchant oil shipping scraper does a few regular expressions to parse some text in one of the columns.

Next time – parsing HTML with CSS selectors.

This entry was posted in developer and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s