I Might Be Wrong

How to extract text from PDF files using poppler and GOCR on Ubuntu

Posted in Howto, Ubuntu by Leif Ryge on July 11, 2010

Poppler provides a suite of utilities for working with PDF files. If you’re using Ubuntu, you’ve already got it installed; otherwise, see if your operating system provides a poppler-utils package.

If your PDF file contains text, you can use the pdftotext command:

pdftotext foo.pdf

If the PDF contains images of the text which you wish to use OCR on, you’ll need to extract them first:

pdfimages foo.pdf foo-page

…which will give you a bunch of files called foo-page-nnn.pbm (read man pdfimages if you’d like to only extract certain pages)

If, as in my case, the resulting images are of white text on a black background, you can correct this using the convert command from ImageMagick (run sudo apt-get install imagemagick)

for i in *pbm; do convert -negate $i $i; done

(I used the shell to iterate through the files because the convert command’s batch functionality doesn’t seem to work with pbm files.)

And, finally, use GOCR to recognize the text in the images:

for i in *pbm; do echo gocr $i; gocr $i > $i.txt; done

Note: I wrote this early last year and just today noticed it still sitting in my drafts. I’m not sure why I didn’t post it before, so I will now. Hopefully these instructions are still accurate!

Tagged with: , , , , ,

8 Responses

Subscribe to comments with RSS.

  1. Sumail said, on October 31, 2010 at 2:56 pm

    Useful! I had some problems since gocr seemed to block on a pbm image, and I was able to convert all pages only removing pages without text

  2. Chris Lanzit said, on February 8, 2012 at 5:39 pm

    Very useful, but one correction. The converted files are .ppm vice .pbm. Once I corrected that, your instructions worked perfectly.

  3. Alexis said, on September 21, 2013 at 11:17 pm

    Try this : If your PDF file contains text, you can use the pdftotext command :

    pdftotext foo.pdf foo.txt

  4. electrical equipment online shop said, on March 1, 2014 at 11:41 pm

    Spot on with this write-up, I honestly believe that this website needs a lot
    more attention. I’ll probably be returning to see more, thanks for the information!

  5. pinfoundry said, on October 2, 2014 at 5:41 pm

    Nice blog here! Also your website lozds up fast!
    What host are you using? Can I get your affiliate link to yoir host?
    I wish my web site loaded uup aas quickly as yours lol

  6. weight safely said, on October 2, 2014 at 6:28 pm

    It’s amazing to go to see this website and reading the views of all
    colleagues concerning this post, while I am also zealous of getting familiarity.

  7. seo india said, on October 3, 2014 at 6:11 pm

    I am not certaijn the place you’re getting your information, but great topic.
    I must spend a while studying much more or figuring out more.

    Thanks for magnificent information I used to be on the lookout for this information forr my mission.

  8. Jan said, on April 27, 2015 at 2:53 am

    Thanks! Really helpful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: