Skip to content

Use OCR on a PDF and show just the OCR output text #1291

Answered by jbarlow83
b01000100 asked this question in Q&A
Discussion options

You must be logged in to vote

ocrmypdf can't quite do this on its own, since it renders an invisible font.

There are some commercial OCR engines that can attempt to reconstruct a document when the font is recognized and give you an editable document as output. That's a beyond what the open source tech available lets us do - we don't have an open source OCR engine that distinguishes fonts or does precision text layout. Although since you're not as concerned about the exact layout you have more options.

You can use pdftotext (maybe with -layout) to extract the text from the finished PDF.

You can also use ocrmypdf --sidecar to generate text files containing the OCR output. Note that in a document with mixed vector/raster…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@b01000100
Comment options

Answer selected by b01000100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants