Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.
Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files.
So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs.
Just enter the URL of your PDF to see a preview in the browser. Click on the text you need – and instantly, you see the underlying XML.
It doesn’t write your scraper for you – but it shows you what you’re scraping, just like “View Source”. And that makes starting out a lot easier.
Scraping PDFs: the problem…
Why is scraping PDFs so hard? Well, the PDF standard was designed to do a particular job: describe how a document looks, anywhere and forever.
It achieves that pretty well. But unlike HTML, the underlying code was never designed to be read. And it contains a lot of bloat.
ScraperWiki already lets you extract XML from a PDF, for simple parsing – you can see the scraperwiki.pdftoxml library in our (incredibly basic) tutorial.
But matching up long-winded XML with what you see on the page isn’t always easy. Julian knows this only too well, having scraped PDFs on a grand scale to create UNDemocracy.
…and the solution
So, the PDF previewer works as follows:
- Grabs the data. Gets the XML using pdftoxml.
- Outputs as HTML. Outputs each PDF page as an absolute-positioned <div>.
Incidentally, the Preview is also a ScraperWiki view, meaning that you can edit the underlying code if you want it to work differently. In particular, feel free to improve the instructions and the layout!
We’ll be improving our PDF-scraping tutorials and examples in the coming weeks. If you’ve written a clever PDF scraper that would make a good basis for tutorials, please let us know in the comments.