Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file #1244

Open
matsumurae opened this issue Feb 3, 2024 · 0 comments
Assignees
Labels

Comments

@matsumurae
Copy link

matsumurae commented Feb 3, 2024

Describe the bug

Reason for this issue

I've been trying to make a lot of Japanese novels I have at home as searchable PDF, which will make it easier to check unknown kanjis. But, generating a searchable PDF didn't went out as expected.

The problem

I've tried different formats but I'm unsure why generates correctly as .txt but isn't as .pdf. I've attached both results: tesseract .txt and OCRmypdf .txt, there's also the .pdf generated by OCRmypdf.

As you can see, both .txt are almost identical (I've seen one diff kanji, but everything looked the same). This doesn't happen with the .pdf. When copy text, it adds spaces:
そのまま 男 の 両足 がふわりと 浮き上がり、 彼 の中で、 世 界がぐるりと回 転した。

This doesn't happen with Apple's OCR over images, which results in the same as the .txt file.

Notes about my computer

  • I'm using a 2020 macbook pro, i5 16gb ram.
  • OS is Sonoma 14.3.
  • Tesseract is called via term (Hyper in my case)
  • OCRmypdf is called via finder using macOS shortcuts (I've configured the same exact run as above)

Steps to reproduce

1. Run tesseract 1.png out -l jpn_vert --psm 5 -c preserve_interword_spaces=1
2. Run ocrmypdf -l jpn_vert --tesseract-pagesegmode 5 --tesseract-config [file to config with preserve_interword_spaces 1] --sidecar output.txt test.pdf output.pdf
3. Open out.txt (the one made with tesseract)
4. Open now output.txt (made with ocrmypdf)
5. Open output.pdf and copy some text.

Files

tesseract-config.cfg

preserve_interword_spaces 1

Files to test:

1
test.pdf

Generated files:

out.txt > Tesseract output
output.txt > OCRmypdf output
output.pdf

How did you download and install the software?

Homebrew

OCRmyPDF version

16.0.4

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants