Hi, I’m Sean

1b0cd3ab-87fa-430e-9f23-45872a4fbaceMy name is Sean Duffy, and I’m an intern at ScraperWiki! I’ve just finished sixth form and in around a month’s time I’ll be starting my degree in Computer Systems and Software Engineering at the University of York. As well as programming, in my spare time I also enjoy electronics, reading and building/flying quadcopters.

When I heard about ScraperWiki on a Hacker News post earlier in the summer, I knew I couldn’t pass up the opportunity to inquire about an internship and now, in my fourth week, I’m glad that I did! In the past I’ve struggled to find relevant work placements close enough to where I live, so it was great to find such an interesting company only a short train journey away.

In my time here I’ve been coding in Python and JavaScript and learnt a lot more than I previously knew about Git, JQuery and regular expressions, and like Matthew I have adopted Vim as my go-to text-editor. I’ve also enjoyed participating in things like pair programming and retrospectives, which form part of the Extreme Programming methodology utilised at ScraperWiki. Aside from that, I’ve met some great people and all of the team have been welcoming and friendly from day one. Another bonus is that the choice of eateries at lunch is always top-quality!

I’m very happy to have been given the privilege of an opportunity as good as this and I know I’ve already taken away a great deal of useful knowledge, skills and experience. I’m sure that the things I’ve learnt at ScraperWiki will serve me well during my degree and the rest of my career. I’ve enjoyed it here so much that it’s made me even more optimistic and excited about a career in software!

That just about concludes my introductory blog post, but if you’d like to read about what I’ve been up to you can visit my blog, and if you’d like to contact me then feel free to email me!

Posted in Uncategorized | 1 Comment

Scrape anyone’s Twitter followers

Words are important, people more so.

Following our popular tool which makes it easy to scrape and download tweets, we’re pleased to announce a new one to get any Twitter account’s followers.

To use it, log into ScraperWiki, choose “Create a new dataset” then pick the tool

'Get Twitter followers' tool

Then enter the name of the user you want (with or without the @).

Twitter followers interface

If they have a lot of followers it will take a few hours. (And it’ll keep updating after that, so the data is always fresh)

Wait for followers

Meanwhile you can use other tools to view the data, download it to Excel and more.

Other tools

For example, using the “Summarise this data” tool, this is what the people who follow @samuelpepys are into in their Twitter bios.

Summarise Twitter bios

Or, using the “View in a table” tool, here are my followers who mention Python in their bios, sorted by popularity. (Click for a big version)

Popularity

I did exactly this a few months ago to publicize a job at ScraperWiki. I found all of the people who followed me, sorted them by follower count and filtered to those interested in Python. I immediately had an easy way to reach the most receptive audience for the job. 

Let us know how you use data about Twitter people!

Posted in Tools, Uncategorized | Leave a comment

It’s good to share…

Image by Jason Empey

Image by Jason Empey

As you may have gathered I’m on a journey, I’ve worked as a physicist, a data scientist for 20 years and now I’ve fallen amongst software engineers. There are obvious similarities in what we do, we write code to do stuff. I write code to analyse things and the software engineers write code to do things for other people.

But the practice of these two disciplines can be quite different. I’ve written about my introduction to practical testing, and I’ve alluded to pair programming. Pair programming is where two programmers work together, side by side on the same piece of code. Nominally the “Driver” sits at the keyboard and the “Navigator” thinks, researches and directs. In practice you talk about what you’re doing, and as a consequence hopefully you produce better code and at the very least you produce code with which two people are familiar.

It’s a strangely sociable activity which I’ve found very educational because sometimes the small practice of how you do things is as important as the big theoretical picture. I, for example, can now passibly use the VIM text editor. And as discussed before I’m now a fan of testing.

Pair programming is a facet of sharing code, I’ve shared my code in the past. My PhD thesis, published 20 years ago has the FORTRAN programs I used to analyse my data printed in the back. I’ve happily shared code with my colleagues in Unilever but this was a sham: I was pretty confident no one would read my code, they wouldn’t build on my code, the most they would do was run it.

Now things are different, I have now written open source code which my colleagues are using and potentially complete strangers could use it. You can see it here. I’ve sat down with people who are actually going to use and extend my code, in fact it is no longer mine.

This has a several effects, firstly there can be changes I don’t necessarily understand. Secondly, style and format are more important. To use an analogy, you could publish a newspaper as one long strip of paper with a single typeface, weight with articles presented in alphabetical order of the first word. But it would be difficult to read and navigate because we are used to the ideas of headlines, bylines, the convention of major news at the front and sports at the back.

And so it is with code, programming languages have coding conventions which aren’t part of the language but which are important when you share code with others. In Python the coding standard is called PEP8, it tells you how to name your functions and layout your code. Different languages have different conventions, writing code in the wrong convention is liking speaking with a foreign accent.

Thirdly, I feel more responsible; I was distressed to discover my colleague relying on a function that I considered to be obsolete, one I had written in the early stages of development but since left to languish. But I had left no evidence that this was the case. My fragmentary, orphaned comments are similarly unhelpful to another programmer.

But it’s good to share, “my” code has benefitted from other people’s insights, and I hope I’m writing better code in the knowledge that other people are using it.

Posted in developer | Tagged | Leave a comment

Hi, I’m Steve

319d3fcHi, I’m Steve and I’m the most recent addition to ScraperWiki’s burgeoning intern ranks. So, how exactly did I end up here?

Looking at ScraperWiki’s team page, you can see that scientists working here is a common theme. I’m no different in that regard. Until recently, I was working as a university scientific researcher (looking at new biomedical materials).

As much as I’ve enjoyed that work, I began to wonder what other problems I could tackle with my scientific training. I’ve always had a big interest in technology. And, thanks to the advent of free online courses from the likes of edX and Coursera, I’ve recently become more involved with programming. When I heard about data science a few months ago, it seemed like it might be an ideal career for me, using skills from both of these fields.

Having written a web scraper myself to help in my job searching, I had some idea of what that involves. I’d also previously seen ScraperWiki’s site while reading about scrapers. When I heard that ScraperWiki were advertising for a data science intern, I knew it would be a great chance to gain a greater insight into what this work entails.

Since I didn’t have any prior notions of what working in a technology company or a startup involves, I’m pleased that it’s been so enjoyable. From an outsider coming in, there are many positive aspects of how the company works:

ScraperWiki is small (but perfectly formed): the fact that everyone is based in the same office makes it easy to ask a question directly to the most relevant person. Even if people are working remotely, they are in contact via the company’s Internet Relay Chat channel or through Google Hangouts. This also means that I’m seeing both sides of the company: both what the Data Services team do and the ongoing work to constantly improve the platform.

Help’s on hand: having knowledgeable and experienced people around in the office is a huge benefit when I encounter a technical problem, even if it’s not related to ScraperWiki. When I’m struggling to find a solution myself, I can always ask and get a quick response.

There’s lots of collaboration: pair programming is a great way to pick up new skills. Rather than struggling to get started with, say, some new module or approach, you can see someone else start working with it and pick up tips to push you past the initial inertia of trying something new.

And there’s independence too: as well as working with others on what they are doing and trying to help where I can, I’ve also been given some small projects of my own. Even in the short time I’m here, I should be able to construct some useful tools that might be made publically available via ScraperWiki’s platform.

(Oh, I shouldn’t miss out refreshments: as Matthew, another intern, recently pointed out, lunch often involves a fun outing to one of Liverpool’s many fine eateries. As well as that, tea is a regular office staple.)

It’s definitely been an interesting couple of weeks for me here. you can usually see what I’m up to via Twitter or my own blog. Over the next few weeks, I’m looking forward to writing here again about what I’ve been working on.

Posted in Uncategorized | Leave a comment

Mastering space and time with jQuery deferreds

A screenshot of Rod Taylor enthusiastically grabbing a lever on his Time Machine in the 1960 film of the same nameRecently Zarino and I were pairing on making improvements to a new scraping tool on ScraperWiki. We were working on some code that allows the person using the tool to pick out parts of some scraped data in order to extract a date into a new database column. For processing the data on the server side we were using a little helper library called scrumble which does some cleaning in Python to produce dates in a standard format. Which is great for server-side, but we also needed to display a preview of the cleaned dates to the user, before it’s finally sent to the server for processing.

Rather than rewrite this Python code in JavaScript we thought we’d make a little script which could be called using the ScraperWiki exec endpoint to do the conversion for us on the server side.

Our code looked something like this:

var $tr = $('<tr>');

// for each cell in each row…
$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));
  // execute this command on the server…
  scraperwiki.exec('tools/do-scrumble.py ' + date, function(response){
    // and put the result into this table cell…
    $td.html(JSON.parse(response));
  });
  $td.appendTo($tr);
});

Each time we needed to process a date with scrumble we made a call to our server side Python script via the exec endpoint. When the value comes back from the server, the callback function sets the content of the table cell to the value.

However when we started testing our code we hit a limit placed on the exec endpoint to prevent overloading the server (currently no more that 5 exec calls can be executing at once).

Our first thought was to just limit the rate at which we made requests so that we didn’t trip the rate limit, but our colleague Pete suggested we should think about batching up the requests to make them faster. Sending each one individually might work well with just a few requests, but what about when we needed to make hundreds or thousands of requests at a time?

How could we change it so that the conversion requests were batched, and the results were inserted into the right table cells once they’d been computed?

jQuery.deferred() to the rescue

We realised that we could use jQuery deferreds to allow us to do the batching. A deferred is like an I.O.U that says that at some point in the future a result will become available. Anybody who’s used jQuery to make an AJAX request will have used a deferred – you send off a request, and specify some callbacks to be executed when the request eventually succeeds or fails.

By returning a deferred we could delay the call to the server until all of the values to be converted have been collected and then make a single call to the server to convert them all.

Below is the code which does the batching:

scrumble = {
  deferreds: {},

  as_date: function (raw_date) {
    if (!this.deferreds[raw_date]) {
      d = $.Deferred()
      this.deferreds[raw_date] = d;
    }
    return this.deferreds[raw_date].promise()
  },

  process_dates: function () {
    var self = this;
    var raw_dates = _.keys(self.deferreds);
    var date_list = scraperwiki.shellEscape(JSON.stringify(raw_dates));
    var command = 'tool/do-scrumble-batch.py ' + date_list;
    scraperwiki.exec(command, function (response) {
      response_object = JSON.parse(response);
      $.each(response_object, function(key, value){
        self.deferreds[key].resolve(value);
      });
    });
  }
}

Each time as_date is called it creates or reuses a deferred which is stored in an object keyed on the raw_date string and then returns a promise (a deferred with a restricted interface) to the caller. The caller attaches a callback to the promise that will use the value once it is available.

To actually send the batch of dates off to be converted, we call the process_dates method. It makes a call to the server with all of the strings to be processed. When the result comes back from the server it “resolves” each of the deferreds with the processed value, which causes all of the callbacks to fire updating the user interface.

With this design the changes we had to make to our code were minimal. It was already using a callback to set the value of the table cell. It was just a case of attaching it to the jQuery promise returned by the scrumble.as_date method and calling scrumble.process_dates, after all of the items had been added, to make the server side call to convert all of the dates.

var $tr = $('<tr>');

$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));
  scrumble.as_date(date).done(function(response){
    $td.html(JSON.parse(response));
  });
  $td.appendTo($tr);
});

scrumble.process_dates();

Now instead of one call being made for every value that needs converting (whether or not that string has already been processed) a single call is made to convert all of the values at once. When the response comes back from the server, the promises are resolved and the user interface updates showing the user the preview as required. jQuery deferreds allowed us to make this change with minimal disruption to our existing code.

And it gets better…

Further optimisation (not shown here) is possible if process_dates is called multiple times. A little-known feature of jQuery deferreds is that they can only be resolved once. If you make an AJAX call like $.get('http://foo').done(myCallback) and then, some time later, call .done(myCallback) on that ajax response again, the callback myCallback is immediately called with the exact same arguments as before. It’s like magic.

We realised we could turn this quirky feature to our advantage. Rather than checking whether we’d already converted a date, and returning the pre-converted date on subsequent calls, rather than adding them to the queue to be processed, we just call the deferred .done() callback regardless, as if this was the first time. Deferreds that have already been handled are returned immediately, meaning we only send requests to the server if there are new dates that haven’t been processed yet.

jQuery deferreds helped us keep our user interface responsive, our network traffic low, and our code refreshingly simple. Not bad for a mysterious set of functions hidden halfway down the docs.

Posted in developer, Tools | Tagged , , , | Leave a comment

Book review: The Tableau 8.0 Training Manual – From clutter to clarity by Larry Keller

Tableau 8.0 Training Manual

My unstoppable reading continues, this time I’ve polished off The Tableau 8.0 Training Manual: From Clutter to Clarity by Larry Keller. This post is part review of the book, and part review of Tableau.

Tableau is a data visualisation application which grew out of academic research on visualising databases. I’ve used Tableau Public a little bit in the past. Tableau Public is a free version of Tableau which only supports public data i.e. great for playing around with but not so good for commercial work. Tableau is an important tool in the business intelligence area, useful for getting a quick view on data in databases and something our customers use, so we are interested in providing Tableau integration with the ScraperWiki platform.

The user interface for Tableau is moderately complex, hence my desire for a little directed learning. Tableau has a pretty good set of training videos and help pages online but this is no good to me since I do a lot of my reading on my commute where internet connectivity is poor.

Tableau is rather different to the plotting packages I’m used to using for data analysis. This comes back to the types of data I’m familiar with. As someone with a background in physical sciences I’m used to dealing with data which comprises a couple of vectors of continuous variables. So for example, if I’m doing spectroscopy then I’d expect to get a pair of vectors: the wavelength of light and the measured intensity of light at those wavelengths. Things do get more complicated than this, if I were doing a scattering experiment then I’d get an intensity and a direction (or possibly two directions). However, fundamentally the data is relatively straightforward.

Tableau is crafted to look at mixtures of continuous and categorical data, stored in a database table. Tableau comes with some sample datasets, one of which is sales data from superstores across the US which illustrates this well. This dataset has line entries of individual items sold with sale location data, product and customer (categorical) data alongside cost and profit (continuous) data. It is possible to plot continuous data but it isn’t Tableau’s forte.

Tableau expects data to be delivered in “clean” form, where “clean” means that spreadsheets and separated value files must be presented with a single header line with columns which contain data all of the same type. Tableau will also connect directly to a variety of databases. Tableau uses the Microsoft JET database engine to store it’s data, I know this because for some data unsightly wrangling is required to load data in the correct format. Once data is loaded Tableau’s performance is pretty good, I’ve been playing with the MOT data which is 50,000,000 or so lines, which for the range of operations I tried turned out to be fairly painless.

Turning to Larry Keller’s book, The Tableau 8.0 Training Manual: From Clutter to Clarity, this is one of few books currently available relating to the 8.0 release of Tableau. As described in the title it is a training manual, based on the courses that Larry delivers. The presentation is straightforward and unrelenting; during the course of the book you build 8 Tableau workbooks, in small, explicitly described steps. I worked through these in about 12 hours of screen time, and at the end of it I feel rather more comfortable using Tableau, if not expert. The coverage of Tableau’s functionality seems to be good, if not deep – that’s to say that as I look around the Tableau interface now I can at least say “I remember being here before”.

Some of the Tableau functionality I find a bit odd, for example I’m used to seeing box plots generated using R, or similar statistical package. From Clutter to Clarity shows how to make “box plots” but they look completely different. Similarly, I have a view as to what a heat map looks like and the Tableau implementation is not what I was expecting.

Personally I would have preferred a bit more explanation as to what I was doing. In common with Andy Kirk’s book on data visualisation I can see this book supplementing the presented course nicely, with the trainer providing some of the “why”. The book comes with some sample workbooks, available on request – apparently directly from the author whose email response time is uncannily quick.

Posted in Data Science | Tagged | Leave a comment

Making a ScraperWiki view with R

In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.

Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.

Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:

R-view

The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.

The code can all be found in this BitBucket repository.

The viewer is based on the knitr package for R, which generates reports in specified formats (HTML, PDF etc) from a source template file which contains R commands which are executed to generate content. In this case we use Rhtml, rather than the alternative Markdown, which enables us to specify custom CSS and JavaScript to integrate with the ScraperWiki platform.

ScraperWiki tools live in their own UNIX accounts called “boxes”, the code for the tool lives in a subdirectory, ~/tool, and web content in the ~/http directory is displayed. In this project the http directory contains a short JavaScript file, code.js, which by the magic of jQuery and some messy bash shell commands, puts the URL of the SQL endpoint into a file in the box. It also runs a package installation script once after the tool is first installed, the only package not already installed is the ggplot2 package.

The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.

This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.

Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:

  • a function to read the SQL endpoint URL which is dumped into the box by some JavaScript used in the Rhtml template.
  • a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
  • a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.

Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library. 

So there you go – not the world’s most exciting tool but it shows the way to make live reports on the ScraperWiki platform using R. Extensions to this would be to allow some user interaction, for example by allowing them to adjust the axis limits. This could be done either using JavaScript and vanilla R or using Shiny.

What would you do with R in ScraperWiki? Let me know in the comments below or by email: ian@scraperwiki.com

Posted in Data Science | Tagged | Leave a comment