Lots of new libraries

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

  • csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
  • requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
  • Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
  • pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
  • PyTidyLib, to access the old classic C library that cleans up HTML files.
  • SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
  • matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
  • Google Data (gdata) for calling various Google APIs to get data.
  • Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

  • tmail for parsing emails.
  • typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
  • Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

  • GeoIP for turning IP addresses into countries and cities.
Let us know if there are any libraries you need for scraping or data analysis!
This entry was posted in developer and tagged , , , , , . Bookmark the permalink.

2 Responses to Lots of new libraries

  1. mazadillon says:

    I don’t think it’s documented anywhere but I found that I could use pdf2txt.py which is part of PDF Miner by using the ‘bash trick’ someone discussed on the mailing list – basically running a local command in the sandbox.

    Perhaps worth mentioning somewhere?

  2. Francis Irving says:

    There’s a small mention in this FAQ: https://scraperwiki.com/docs/python/faq/#files

    But no, it isn’t very clear! Will update the FAQ a bit to have a separate question saying you can spawn external commands with link to some examples.

    Just to get a handle on your use case, are you calling PDF Miner from Ruby or PHP? Assuming Python people would call the API?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s