How to extract text from PDF files using poppler and GOCR on Ubuntu

Posted in Howto, Ubuntu by Leif Ryge on July 11, 2010

Poppler provides a suite of utilities for working with PDF files. If you’re using Ubuntu, you’ve already got it installed; otherwise, see if your operating system provides a poppler-utils package.

If your PDF file contains text, you can use the pdftotext command:

pdftotext foo.pdf

If the PDF contains images of the text which you wish to use OCR on, you’ll need to extract them first:

pdfimages foo.pdf foo-page

…which will give you a bunch of files called foo-page-nnn.pbm (read man pdfimages if you’d like to only extract certain pages)

If, as in my case, the resulting images are of white text on a black background, you can correct this using the convert command from ImageMagick (run sudo apt-get install imagemagick)

for i in *pbm; do convert -negate $i $i; done

(I used the shell to iterate through the files because the convert command’s batch functionality doesn’t seem to work with pbm files.)

And, finally, use GOCR to recognize the text in the images:

for i in *pbm; do echo gocr $i; gocr $i > $i.txt; done

Note: I wrote this early last year and just today noticed it still sitting in my drafts. I’m not sure why I didn’t post it before, so I will now. Hopefully these instructions are still accurate!

