7 ways to get data out of PDFs – Data Journalism Blog

HELP ME INVESTIGATE – By Paul Bradshaw

A frequent obstacle in data journalism is when the information you want to analyse is locked away in a PDF. Here are 6 ways to tackle that problem – with space for a 7th:

1) For simple PDFs: Google Docs’ conversion facility

Google Docs recently added a feature that allows you to convert a PDF to a ‘Google document’ when you upload it. It’s pretty powerful, and about the simplest way you can extract information.

It does not work, however, if the PDF was generated by scanning – in other words if it is an image, rather than a document that has been converted to PDF.

2) For scanned documents and pulling out key players: Document Cloud

Document Cloud is a tool for journalists to convert PDFs to text. It will also add ‘semantic’ information along the way, such as what organisations, people and ‘entities’ such as dates and locations are mentioned within it, and there are some useful features that allow you to present documents for others to comment on.

The good news is that it works very well with scanned documents, using Optical Character Recognition (OCR). The bad news is that you need to ask permission to use it, so if you don’t work as a professional journalist you may not be able to use it. Still, there’s no harm in asking. [Read more…]

4 Replies to “7 ways to get data out of PDFs”

Where are the other ways? The link leads to a 404 page….

Marianne Bouchart says:

December 2, 2011 at 12:44 pm

Thanks for letting me know Joerg!! The site changed the url, the rest of the article can now be found here: http://helpmeinvestigate.posterous.com/7-ways-to-…
We also udpated the link.
Sorry for the inconvenience

I'm sorry, the new links is broken too 🙁

Here it is a site with the tips… thanks http://helpmeinvestigate.com/blog/7-ways-to-get-d…

Comments are closed.