Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Integrations with other backends via hOcr (naive implementation of easyOcr backend inside) #1250

Open
coffepowered opened this issue Feb 10, 2024 · 4 comments
Assignees

Comments

@coffepowered
Copy link

coffepowered commented Feb 10, 2024

Describe the proposed feature

Hi, I see there are a few issues on the board proposing integrations of new backends.

I wondered how difficult this would be to do naively: it turns out that's doable, here's the result of a quick-and-dirty plugin I created in a couple of hours. I converted a nonreadable sample pdf using OCRmyPDF with easyOCR backend:

image

I basically created a hocr output from easyOCR result's object.However I am not sure if this is a suitable approach or has fundamental limitations that prevent this kind of integration from succeeding.

I expect any OCR to provide bounding boxes+text (at least) that can be then expressed in hOcr format.

Is there some profound or semantic limitations I am unaware of that make the reconstruction of hOcr format difficult?

@coffepowered
Copy link
Author

Oh, nevwrmind about the easyocr stuff, I see there's an official plugin (I'll look into that).

I still would be grateful of @jbarlow83 can comment on the issue in general: what are the difficulties/nuances of supporting multiple backends- besides the thin wrappinfg layer I was so clumsy to try reimplementing myself?

@jbarlow83
Copy link
Collaborator

I wrote easyocr as direct to PDF, and then significantly improved hocr to PDF so that the various OCR engines that return some flavor of JSON will have an easy path to convert to hocr. So hocr will become the one common backend for any OCR engine. hocr is also reasonably editable to support the use case of people who want to change the final OCR (e.g. manual spelling correction etc.) and the API supports this now.

Currently right to left text (Arabic-Hebrew-Farsi) is a problem, and I'm not sure there's a glyphless font solution that will work for these languages on all PDF viewers (they have serious issues in Tesseract PDF too). I had pretty limited time for that effort and had to stop at that point.

But if RTL isn't a concern, then you just need to do a fairly simple JSON to hocr conversion at this stage.

@coffepowered
Copy link
Author

Thanks for the response, it's appreciated to have some help and orientation with this.

I wrote easyocr as direct to PDF, and then significantly improved hocr to PDF so that the various OCR engines that return some flavor of JSON will have an easy path to convert to hocr. So hocr will become the one common backend for any OCR engine

So, as far as I understand, if you were to rewrite to easyOCR plugin today I'd rather use hocr instead, is this accurate?
I am asking because I am thinking about a flexible way to support other open OCR model/engines: one super simple way would be to "convert" to results to be easyOCR-like then use your official plugin (which is way better engineered than my POC code).

Currently right to left text (Arabic-Hebrew-Farsi) is a problem, and I'm not sure there's a glyphless font solution that will work for these languages on all PDF viewers (they have serious issues in Tesseract PDF too)

Great insight, fortunately RTL languages are definitely out of my use cases.

@jbarlow83
Copy link
Collaborator

Yes, you definitely could convert some other OCR engine's JSON to EasyOCR JSON, and use the existing plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants