[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file #1244

matsumurae · 2024-02-03T11:19:18Z

Describe the bug

Reason for this issue

I've been trying to make a lot of Japanese novels I have at home as searchable PDF, which will make it easier to check unknown kanjis. But, generating a searchable PDF didn't went out as expected.

The problem

I've tried different formats but I'm unsure why generates correctly as .txt but isn't as .pdf. I've attached both results: tesseract .txt and OCRmypdf .txt, there's also the .pdf generated by OCRmypdf.

As you can see, both .txt are almost identical (I've seen one diff kanji, but everything looked the same). This doesn't happen with the .pdf. When copy text, it adds spaces:
そのまま男の両足がふわりと浮き上がり、彼の中で、世界がぐるりと回転した。

This doesn't happen with Apple's OCR over images, which results in the same as the .txt file.

Notes about my computer

I'm using a 2020 macbook pro, i5 16gb ram.
OS is Sonoma 14.3.
Tesseract is called via term (Hyper in my case)
OCRmypdf is called via finder using macOS shortcuts (I've configured the same exact run as above)

Steps to reproduce

1. Run tesseract 1.png out -l jpn_vert --psm 5 -c preserve_interword_spaces=1
2. Run ocrmypdf -l jpn_vert --tesseract-pagesegmode 5 --tesseract-config [file to config with preserve_interword_spaces 1] --sidecar output.txt test.pdf output.pdf
3. Open out.txt (the one made with tesseract)
4. Open now output.txt (made with ocrmypdf)
5. Open output.pdf and copy some text.

Files

`tesseract-config.cfg`

preserve_interword_spaces 1

Files to test:

test.pdf

Generated files:

out.txt > Tesseract output
output.txt > OCRmypdf output
output.pdf

How did you download and install the software?

Homebrew

OCRmyPDF version

16.0.4

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

matsumurae added the bug label Feb 3, 2024

matsumurae assigned jbarlow83 Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file #1244

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file #1244

matsumurae commented Feb 3, 2024 •

edited

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file #1244

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file #1244

Comments

matsumurae commented Feb 3, 2024 • edited

Describe the bug

Reason for this issue

The problem

Notes about my computer

Steps to reproduce

Files

tesseract-config.cfg

Files to test:

Generated files:

How did you download and install the software?

OCRmyPDF version

Relevant log output

matsumurae commented Feb 3, 2024 •

edited

`tesseract-config.cfg`