#InspiringWomen – catching twitter with ScraperWiki

Commodore Grace Hopper

Commodore Grace Hopper

Those of you on twitter may have caught the recent #InspiringWomen hash tag, this was a response to the online abuse and threats received by many women in the public eye. On Sunday 4th August people tweeted about women who inspired them marking their tweets with the #InspiringWomen hashtag.

#InspiringWomen was launched by my friend, @daintyballerina. In the aftermath she asked: how to capture all of these tweets? The problem being that the number of tweets (about 40,000) involved does not sit well with many online tools since they take a long time to download.

This is a job for the ScraperWiki Search for tweets tool! Getting this going is simply a matter of typing in a search term and clicking a couple of buttons. The search tool uses the free twitter API so takes a little while to get such a big set of tweets but this is OK since the ScraperWiki platform will chug away getting tweets even if you’ve switched off your computer. Anyway overnight, I had collected 40,000 tweets.

I offered to supply all of the tweets, with 16 pieces of information for each tweet, to @daintyballerina as an Excel spreadsheet. This is easy to do with the Download as spreadsheet tool. However, we agreed that it would be best if I supplied the unique tweets along with the name of the twitter user and the number of tweets in descending order of popularity. Looking at the 20,000 or so remaining tweets I then decided to limit to just the tweets with at least one retweet. This set of data can be generated from the tweets using the Query with SQL tool. SQL is a database querying language of long pedigree. I then exported the tweets to Word and hence to @daintyballerina.

You can see the archive of tweets at the Inspiring Women Project website. For the impatient the most retweeted five nominations were:

  1. Emma Watson
  2. Ada Lovelace
  3. Delia Derbyshire
  4. JK Rowling
  5. Hedy Lamarr

But the fun doesn’t stop there!

The Summarise Automatically tool gives you a quick view of any dataset, making a guess at how best to show the data. For example, a column with lots of text is shown as a wordle:

Inspiring Women Wordle
The wordle skips common words like if and it.

The twitter API returns both media and URL data, these both contain at least some images. Columns with links to images are shown as a montage of the most frequent images:

Inspiring Women

On the ScraperWiki platform these images are linked and you can use the Google Image Search to find out who they are. From top left to bottom right, these images show Emma Watson, Lucille Bluth – fictional character from Arrested Development, Girls Generation – South Korea girl band, hairy guy – something of a gay pinup, Marilyn Monroe, Donald Trump (I don’t know!), Rosa Parks, “Hehalutz women captured with weapons”  – from the report of Jurgen Stroop on the Warsaw Ghetto UprisingNellie Spindler – the only woman to die at the Ypres Salient in the First World War, Mary Shelley, Nancy Pelosi, Hayley Williams lead vocalist in Paramore, Carribean woman from the Auxillary Territorial Service WWII, Women’s Land Army recruitment poster, Violett Szabo – Allied spy executed by the SS and finally a model wearing a cardigan. It’s possible the model wearing a cardigan is someone famous.

Summarise automatically also produces a histogram of when tweets with the #InspiringWomen hash tag were sent. This is OK but the resolution of one day is a bit low so I downloaded the data as a spreadsheet and viewed it using Tableau Public. This works nicely – except Tableau is very picky about date time formats and I had to create a new column in Excel to get it to import. Thanks to Tableau Public I can not only show you the result as an image:

InspiringWomenTimeline
…but you can play with an interactive version of the plot on Tableau public, here. This allows you to see the underlying data, download it as an image or download it to Tableau Public on your computer to change as you wish.

The chart shows the number of tweets per minute from late on Saturday 3rd August through to the morning of 6th August; higher peaks mean more tweets.

As you can see the discussion of #InspiringWomen started on the evening of Saturday 3rd and rose during the morning until 11am when everyone woke up and started tweeting. The big peak at just after midnight on Monday 5th November is the nomination for Emma Watson (the actress who played Hermione in the Harry Potter films) which was then heavily retweeted. The very thin big spike at 6am the same morning is spammers randomly re-tweeting one of the #InspiringWomen tweets – you can tell this by looking at the user names which are all a bit random looking. Presumably they appear in a big spike because otherwise twitter would filter them automatically.

What would you want to know about a hash tag?
Let me know.

Footnote

The picture at the top of this post is Grace Hopper, who popularised the term “bug” with reference to programming, although in her case bugs were literally bugs – insects found on the valves of early computers. On the day, I nominated my mum as an #InspiringWoman, she started programming in the early 1960’s but she wouldn’t want me to share her picture with the world!

Posted in Data Science | Tagged | Leave a comment

Hello, I’m Ed

edcawLast week, Pete wrote his welcome post announcing he was the new guy. Not strictly true as I am the real new guy! Actually, Pete and I started on the same day and there are two other new starters you will be hearing about soon, so we are both old news!

Unlike some of the big developer/data science brains we have at ScraperWiki, I have a different role – making sure what we build adds enough value that our customers are willing to pay for some of it!

I started out working life in the RAF as a trainee pilot but a hockey stick to the eye cut that short. It did allow me to start down a career with a more technical bent by completing a Masters in Information Systems and Technology at City University.

From there I was a Sales Engineer for different software vendors and moved into product management. I have worked for different sized companies from the very large (IBM) to the very small (my own startup building tools for product managers) before arriving at ScraperWiki.

Products are created to solve problems for people/companies and therefore make their lives easier, so my first task is to workout what problems our customers and others interested in data experience with ScraperWiki and the tools have today. ScraperWiki can then evolve to become the tool of choice for tomorrow 🙂

Let me know if you have any feedback or ideas – I am really interested to hear your thoughts.

Posted in thoughts | Leave a comment

Adventures in Tableau – loading files

TableauLogo

Tableau is a widely used visualisation tool, particularly in the business intelligence area. It grew out of the Polaris project at Stanford University, subtitled “interactive database visualisation”. This is worth bearing in mind since it is the context in which Tableau deals with data. It anticipates that the data you are interested in comes in the form of database tables which has a downside in terms of flexibility for the data it can import and an upside in terms of connecting tables of data in the form of conventional database joins.

The ScraperWiki platform makes a nice fit with Tableau since we specialise in sourcing and offering up cleaned data whilst Tableau specialises in visualising clean data; we’ve found it to be a powerful tool in this respect.

I’m planning a series of posts on Tableau, my aim is not to write a list of recipes i.e. “Click this to get this” but to describe how Tableau works, thus making it easier to address new challenges. I’m using Tableau Version 8.0 Professional. I’ll start today with a post on how to load data into Tableau, future posts will cover measures and dimensions, calculated fields, creating visualisations, filtering and so forth (to cover the fact I haven’t planned everything out in advance!).

This first post is about the process of loading data into Tableau. Tableau makes a clean separation between data and visualisation. The first step in making a visualisation is connecting to the data, accessed via the Data->Connect to Data menu item, or the Connect to Data link in the panel at the top left of the screen. This data connection can be to one of the following local sources:

  • Microsoft Access
  • Microsoft Excel
  • Text file
  • Tableau data extract
  • Import from workbook

Tableau data extracts (*.tde) are Tableau’s native, binary data format, typically they are built by importing data from other sources. They can also be generated programmatically using the Tableau Data Engine API. A workbook is Tableau’s label for a visualisation file, import from workbook just gets the data used in another workbook.

Alternatively, in the Professional version, there is a list of 30 or so server based sources including SQL databases of various types, Hadoop based databases, data warehouses, and a couple of services such as Google Analytics and Salesforce.

Tableau data sources

Tableau data sources

It is also possible to do a simple copy-paste from a source like Excel using the Data->Paste Data menu option. For the Excel and Text files options your files need to be formatted in a particular way – essentially to look like database tables. That’s to say with a single column header line (maximum), and for each column to contain the same type of data.

Once you have selected a data source you are given options, which vary a little bit between sources.

Tableau File Import dialog

Tableau File Import dialog

The key variation between these methods is in the connection/import options, the database connections need authentication and connection information. The interesting part is in the variously named table selection part: for all data sources it is possible to select multiple tables at import and join them together, doing this enables you to add new columns to a parent table using a common column as a key. For example, you may have a column in your main table which is a short code for something (e.g. vehicle fuel type), in another small table you might have a lookup which gives the long name for these short codes. Joining the tables at load adds this long name column whilst keeping the short code. There is an alternative method of data blending which keeps the data sources separate whilst making a connection between them.

Finally, you are asked whether you want to Connect Live to the data, Import All data or Import Some Data. Connect Live means that Tableau accesses the original file rather than producing a Tableau Data Extract.

Tableau connection type dialog

Tableau connection type dialog

Once you configured a data connection you can save it for future reuse in using the menu item Data->[Your data connection name]->Add to saved data sources. Saved Data Sources (*.tds) are small xml files. There are a whole bunch of options on the Data->[Your data connection name] menu.

Tableau Data Menu

Tableau Data Menu

Generally they enable you to make changes to the options chosen when setting up the data connection.

Tableau uses the Microsoft database engine to connect to Excel and text based sources, it will automatically decide what type of data appears in each column (i.e. string, number, date) sometimes, for CSV files, it gets it wrong in which case you need to use this trick.

You can check your data has loaded correctly using View Data, variously found as an icon at the top of the Dimensions pane, or on the menu item Data->[Your data connection name]->View Data… Once the data has been successfully loaded a list of variable names appears in the Dimensions and Measures panels at the side of the screen, more of these in the next post.

Are you interested in a Tableau connector tool for ScraperWiki? If so let me know at ian@scraperwiki.com

Posted in Data Science | Tagged | Leave a comment

Exploring Stack Exchange Open Data

Inspired by my long commute and the pretty dreadful EDM music blasting out in my gym, I’ve found myself on a bit of a podcast kick lately. Besides my usual NPR fare (If you’ve not yet listened to an episode of This American Life with Ira Glass, you’ve missed out), I’ve been checking out the Stack Exchange podcast; a fairly irreverent take on the popular Q & A website hosted by the founders. On the 51st episode, they announced the opening of their latest site which focuses on the exciting world of open data.

Perhaps the most common complaint I’ve heard since I’ve started surrounding myself with data scientists is that getting specific sets of data can be frustratingly hard. Often, you will find that what you can get by scraping from a website is more than sufficient. That said if you’re looking for something oddly specific like the nutritional information of all food products on the shelves of UK supermarkets, you can quickly find yourself hitting some serious brick walls.

That’s where Stack Exchange Open Data comes in. It follows the typical formula that Stack Overflow has adhered to since its inception. Good questions rise to the top whilst bad ones fade into irrelevance.

stackexchange

The aim of this site is to provide a handy venue for finding useful datasets to analyze or use in projects. Despite only opening quite recently, it has garnered a large userbase and people are asking interesting questions and getting helpful answers. These range from finding out information about German public transportation to global terrain data .

Will you be using Stack Exchange Open Data in one of your future projects? Has Stack Exchange Open Data helped out out find a particularly elusive dataset? Let me know in the comments below.

Posted in opendata | Leave a comment

Hi, I’m Peter

avatar.. and I’m the new guy. I’ve just completed my PhD in particle physics on the ATLAS experiment at CERN. I loved the physics (because “searching for extra dimensions of space” sounds so cool!) but after 8 years I decided I wanted to do something different. At heart, I’m a programmer and a hacker who is fascinated by computers and the immense power they put in your hands. We live in an age where a single person can sift through billions of records in an instant. Even today I repeatedly find myself saying “we live in the future, Man”. Yet we take Google (or DuckDuckGo) for granted.

On my travels I have spent a lot of time with the lower levels of the machine, writing an optimized data format for ATLAS’ huge amount of data. I also collaborated with friends on a tool for visualizing the nature of our proton collisions. My default state is to be immersed in code and data.

I was searching for a new job to start my future career outside of academia and there was little to be found. Outside of London or Silicon Valley, there seemed to be very few companies in the world — let alone in my locality — which understood who I was and what made me tick. It is very fortuitous that I’ve found myself working with this band of awesome people on stuff we care about at ScraperWiki.

In the short term, the focus of my efforts will be building tools for ScraperWiki’s new platform and enhancing the platform itself to make it work faster so that we can provide deeper value to our customers. In the medium term I’m hoping to introduce Docker to our toolset and eventually expose it to our users, so that you can trivially run your tools and code anywhere!

Think I might be able to help you? Shoot me a mail.

Posted in thoughts | Leave a comment

My First Month As an Intern At ScraperWiki

The role of an intern is often a lowly one. Intern duties usually consist of the provision of caffeinated beverages, screeching ‘can I take a message?’ into phones and the occasional promenade to the photocopier and back again.

ScraperWiki is nothing like that. Since starting in late May, I’ve taken on a number of roles within the organization and learned how a modern-day, Silicon Valley style startup works.

How ScraperWiki Works

It’s not uncommon for computer science students to be taught some project management methodologies at university. For the most part though, they’re horribly antiquated.

ScraperWiki is an XP/Scrum/Agile shop. Without a doubt, this is something that is definitely not taught at university!

Each day starts off with a ‘stand up’. Each member of the ScraperWiki team says what they intend to accomplish in the day. It’s also a great opportunity to see if one one of your colleagues is working on something on which you’d like to collaborate.

Collaboration is key at ScraperWiki. From the start of my internship, I was pair programming with the many other programmers who are on staff. For those of you who haven’t heard of it before, pair programming is where two people use one computer to work on a project. It’s nothing like this:

This is awesome, because it’s a totally non-passive way of learning. If you’re driving, you’re getting first-hand experience of writing code. If you’re navigating, then you get the chance to mentally structure the code that you’re working on.

In addition to this, every two weeks we have a retrospective where we look at how the previous fortnight went and where the next steps we intend to take as an organization. We write a bunch of sticky-notes where list what was good and what was bad about the previous week. These are then put into logical groups. We then vote for the group of stickies which best represent where we feel that we should focus our efforts as an organization.

What We Work On

Perhaps the most compelling argument for someone to do an internship at ScraperWiki is that you can never really predict what you’re going to do from one day to the next. You might be working on an interesting data science project with Dragon or Paul, doing front end development with Zarino or making the platform even more robust with Chris. As a fledgling programmer, you really get an opportunity to discover what you enjoy.

During my time working at ScraperWiki, I’ve had the opportunity to learn about some new, up and coming web technologies, including CoffeeScript, Express and Backbone.js.  These are all pretty fun to work with.

It’s not all work and no play too. Most days we go out to a local restaurant and get some food and eat lunch together. Usually it’s some variety of Middle-Eastern, American or Chinese. It’s also usually pretty delicious!

Scraper-Scoff

All in all, ScraperWiki is a pretty awesome place to intern. I’ve learned so much in just a few weeks, and I’ll be sad to leave everyone when I go back to my second year of university in October.

Have you interned anywhere before? What was it like? Let me know in the comments below!

Posted in thoughts, Uncategorized | Tagged , , , , , , | 1 Comment

9 things you need to know about the “Code in your browser” tool

Code in your browser

ScraperWiki has always made it as easy as possible to code scripts to get data from web pages. Our new platform is no exception. The new browser-based coding environment is a tool like any other.

Here are 9 things you should know about it.

Choose language 1. You can use any language you like. We recommended Python, as it is easy to read, and has particularly good libraries for doing data science.

2. You write your code using the powerful ACE editor. This has similar keystrokes and features to Window-based programmer’s editors. ACE logo

3. It’s easy to transfer a scraper from ScraperWiki Classic. Find your scraper, choose “View source”, then copy and paste the code into the new “Code in your browser” tool. You have to make sure you keep the new first line that says the language, e.g. “#!/usr/bin/env python”.

4. There are tutorials on Github, if you want to learn to scrape. It’s a wiki, please help improve them! The tutorials work just as well on your own laptop too. Stop / Running

5. To run the code press the “Run” button, to stop it press the “Stop” button. Schedule menu

6. The code carries on running in the background even if you leave the page. You can come back and see the output log, or even see a scheduled run happening mid-flow.

7. It has flexible scheduling. As well as hourly, daily and monthly, you can choose the time of day you want it to run.

8. You can SSH in, if you need to do something the tool doesn’t do. Your scraper is in “code/scraper”. You can install new libraries, add extra files, edit the crontab, access the SQLite database from the command line, use the Python debugger… Whatever you need.

Report bug

9. It’s open source. You can report bugs and make improvements to the tool’s interface. Please send us pull requests! Want to know more? Try this quick start guide. Read the tool’s FAQ. Or find out 10 technical things you didn’t know about the new ScraperWiki.

Posted in developer | 2 Comments