A very long time ago I discovered the easiest webscraping target: the locations of all the North Sea Oil wells.
Once you webcrawl through the index pages, the entries were pretty straightforward. There were dates, water depths (in feet or metres), GPS locations and so on. The code, if you want to look at it, is here.
Now, you could do the obvious thing and build a mash-up of all 10,960 well holes (which I had for a while), which people can admire. But then what?
But to have each data point of the activity analysed by many eyes, I made a twitter-bot that would tweet every time a new oil well was drilled.
Sounded straight forward. But nothing is ever so straight forward. Especially when the data itself gives you other ideas.
For those of you who don’t want to read all the way to the bottom, I give you @NorthSeaOil1.
If you look at the image on the right you’ll see that there are several dates in the record. The SPUD date (when the drill first entered the ground), Date TD Reached (when the total depth of the well was made), Completion Date (when the drilling crew finished whatever they were doing with the borehole, such as installing pipes or taking samples, before leaving the site).
I wanted a tweet for each of these events as they happened.
This requirement means the scraper can’t just be the simple kind that goes through and takes a copy of their database. It needs to keep track of what the pages said before each scrape so it knows when they change and what changed.
I chose to keep a copy of each record whenever there was a change, as well as record the time it was entered into the scraperwiki database. That way I could select all the records for one particular Well Registration Number, sort by the scrape date and be able to compare the last two records when generating the tweet.
If you want to go straight to my code, it’s here. If you want to know how I published the code without giving away the secret twitter access keys, read on.
The tweeting function uses a Python module called tweepy. (If you find similar modules in Ruby or PHP that you would like to use, please tell us about them.)
Its instruction pages walk you through setting up a Twitter Application for your bot that connects to your account, as well as getting your Consumer key, Consumer secret, Access token and Access token secret. (Don’t forget to enable it for Read and write.)
Now, normally you’d put these values into your code, but then everyone would see them. So I’ve devised a little hack so you can put them into the description of your scraper in the following form:
__BEGIN_QSENVVARS__ CONSUMER_KEY = XXXXXX CONSUMER_SECRET = XXXXXXXXXXXX ACCESS_KEY = XXXXXX-XXXXXXX ACCESS_SECRET = XXXXXXXXXXXXXXX __END_QSENVVARS__
Try it to make sure that they are masked when you are not editing the description.
These values are passed in through the query_string, and you are able to access them like so:
import os, cgi qsenv = dict(cgi.parse_qsl(os.getenv("QUERY_STRING"))) auth = tweepy.OAuthHandler(qsenv["CONSUMER_KEY"], qsenv["CONSUMER_SECRET"]) auth.set_access_token(qsenv["ACCESS_KEY"], qsenv["ACCESS_SECRET"])
Don’t forget to set your tweet-bot scraper to “Protected” so people can’t edit the code and print out your secret tokens!
Then comes the complicated task of making your information read like English in 140 characters or less.
You can read all the tricks I’ve used such as turning a block number into a readable location by inspecting this map (line 102), not stating the date twice when the completion date and total depth date are the same (line 89), and inserting the word “now” into the sentence when there is a change of completion status (line 98).
There’s always more to do to make it better. It takes a while to hone your data for tweeting.
But if it’s a good one maybe a lot of people will follow it and find it interesting.
So watch out North Sea Oil Wells – you’ve just been ScraperWikied!