Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Open
DarioBernardo opened this issue Apr 26, 2024 · 3 comments
Labels
bug Something isn't working ocr Related to optical character recognition (OCR).

Comments

@DarioBernardo
Copy link

Describe the bug
I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from PDF documents in Greek, the output text appears in a non-Greek alphabet and is unreadable, making it impossible to use for my purposes.

To Reproduce
This is the code I am using, running it on any greek document will reproduce the error:

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

filename = "example_files/c_20230111133942393_2525540.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["gr"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)

Expected behavior
I expect the extracted text to accurately represent the original Greek characters from the PDF document.

Actual results
The extracted text contains characters that are not in the Greek alphabet, rendering the text unreadable. Here's a snippet of what I get:

{
    "type": "NarrativeText",
    "element_id": "aaad19db9a99367b392003c6db4a7e2b",
    "text": "\u00a3635 TO TGS TOV KTV sPSopmva E5 yihddav oySovia Svo ko Sixx entd exatootdv (176.082,17) EURO, nov avriotogi o8 g&ivia exatoppudpia (60.000.000) Spayy\u00e9e,",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "gr"
      ],
      "page_number": 1,
      "parent_id": "1b406d8798c823dcd1d195f4a4f331dd",
      "filename": "c_20230111133942393_2525540.pdf"
    }
  }

Additional context

  • Using the latest version of the Unstructured SDK.
  • Issue occurs consistently with multiple documents in Greek.

Could this issue be due to a missing OCR plugin for the Greek language? Since I am utilizing the API, I would expect such components to be managed server-side.

@DarioBernardo DarioBernardo added the bug Something isn't working label Apr 26, 2024
@christinestraub
Copy link
Contributor

Hi @DarioBernardo Can you please share the PDF document (c_20230111133942393_2525540.pdf)?

@DarioBernardo
Copy link
Author

Hi @christinestraub thank you for looking into my issue, no unfortunately I can't share the document, but I am sure the issue is replicable with most greek documents. Something I think worth mentioning is that the document is a scan of a paper document, hence it is made from images.

@DarioBernardo
Copy link
Author

I'd like to provide some additional context regarding the issue. I searched online for publicly available PDF documents that could help replicate the problem. I've confirmed that the issue arises when the API attempts to perform OCR on characters from images in PDFs. Specifically, when the PDF is a scan of a document, the OCR tool behind the API fails to recognize Greek characters and substitutes them with ASCII characters instead. However, if the content can be directly read from the PDF, the correct non-ASCII Unicode escape characters are provided. This may be due to limitations in Tesseract, which I believe is the OCR tool behind the API.

For instance, you can test this using the document available here. The document title, being part of an image, is not recognized correctly, whereas the rest of the document, which is text-based, is accurately processed.

@scanny scanny added the ocr Related to optical character recognition (OCR). label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ocr Related to optical character recognition (OCR).
Projects
None yet
Development

No branches or pull requests

3 participants