Scraping PDFs: now 26% less unpleasant with ScraperWiki

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files.

So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs.

Just enter the URL of your PDF to see a preview in the browser. Click on the text you need – and instantly, you see the underlying XML.

The PDF to HTML Preview.

It doesn’t write your scraper for you – but it shows you what you’re scraping, just like “View Source”. And that makes starting out a lot easier.

Scraping PDFs: the problem…

Why is scraping PDFs so hard? Well, the PDF standard was designed to do a particular job: describe how a document looks, anywhere and forever.

It achieves that pretty well. But unlike HTML, the underlying code was never designed to be read. And it contains a lot of bloat.

Adobe HQ in California

Adobe HQ in California. Locals say that only one person works inside - a reference to PDFs' bloated filesize.

ScraperWiki already lets you extract XML from a PDF, for simple parsing – you can see the scraperwiki.pdftoxml library in our (incredibly basic) tutorial.

But matching up long-winded XML with what you see on the page isn’t always easy. Julian knows this only too well, having scraped PDFs on a grand scale to create UNDemocracy.

…and the solution

So, the PDF previewer works as follows:

  • Grabs the data. Gets the XML using pdftoxml.
  • Outputs as HTML. Outputs each PDF page as an absolute-positioned <div>.
  • Adds Javascript onclick events. Attaches simple events so that when you click on a word or phrase, you see the underlying XML.

Incidentally, the Preview is also a ScraperWiki view, meaning that you can edit the underlying code if you want it to work differently. In particular, feel free to improve the instructions and the layout!

We’ll be improving our PDF-scraping tutorials and examples in the coming weeks. If you’ve written a clever PDF scraper that would make a good basis for tutorials, please let us know in the comments.

This entry was posted in developer and tagged , , , , . Bookmark the permalink.

2 Responses to Scraping PDFs: now 26% less unpleasant with ScraperWiki

  1. Pingback: links for 2010-12-17 « Sarah Booker

  2. Stoph says:

    This is great, but the pdftohtml codebase doesn’t appear to handle newer pdfs. It just shows page numbers for the following pdf. In contrast, xpdf can convert it to plain text or to simple html.

    http://dynamodata.fdncenter.org/990_pdf_archive/351/351019477/351019477_200806_990.pdf

    pdftohtml is based on xpdf 2.02 while xpdf has now moved on to 3.02.
    It is possible to incorporate xpdf and use its layout mode and “grep-like” processing for scraping?

    http://www.foolabs.com/xpdf/

    Thanks,

    Stoph

    P.S. By the way, fantastic work on scraperwiki

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s