Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

DarioBernardo · 2024-04-26T11:35:34Z

Describe the bug
I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from PDF documents in Greek, the output text appears in a non-Greek alphabet and is unreadable, making it impossible to use for my purposes.

To Reproduce
This is the code I am using, running it on any greek document will reproduce the error:

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

filename = "example_files/c_20230111133942393_2525540.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["gr"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)

Expected behavior
I expect the extracted text to accurately represent the original Greek characters from the PDF document.

Actual results
The extracted text contains characters that are not in the Greek alphabet, rendering the text unreadable. Here's a snippet of what I get:

{
    "type": "NarrativeText",
    "element_id": "aaad19db9a99367b392003c6db4a7e2b",
    "text": "\u00a3635 TO TGS TOV KTV sPSopmva E5 yihddav oySovia Svo ko Sixx entd exatootdv (176.082,17) EURO, nov avriotogi o8 g&ivia exatoppudpia (60.000.000) Spayy\u00e9e,",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "gr"
      ],
      "page_number": 1,
      "parent_id": "1b406d8798c823dcd1d195f4a4f331dd",
      "filename": "c_20230111133942393_2525540.pdf"
    }
  }

Additional context

Using the latest version of the Unstructured SDK.
Issue occurs consistently with multiple documents in Greek.

Could this issue be due to a missing OCR plugin for the Greek language? Since I am utilizing the API, I would expect such components to be managed server-side.

The text was updated successfully, but these errors were encountered:

christinestraub · 2024-04-26T16:23:49Z

Hi @DarioBernardo Can you please share the PDF document (c_20230111133942393_2525540.pdf)?

DarioBernardo · 2024-04-26T16:34:26Z

Hi @christinestraub thank you for looking into my issue, no unfortunately I can't share the document, but I am sure the issue is replicable with most greek documents. Something I think worth mentioning is that the document is a scan of a paper document, hence it is made from images.

DarioBernardo · 2024-04-29T10:39:32Z

I'd like to provide some additional context regarding the issue. I searched online for publicly available PDF documents that could help replicate the problem. I've confirmed that the issue arises when the API attempts to perform OCR on characters from images in PDFs. Specifically, when the PDF is a scan of a document, the OCR tool behind the API fails to recognize Greek characters and substitutes them with ASCII characters instead. However, if the content can be directly read from the PDF, the correct non-ASCII Unicode escape characters are provided. This may be due to limitations in Tesseract, which I believe is the OCR tool behind the API.

For instance, you can test this using the document available here. The document title, being part of an image, is not recognized correctly, whereas the rest of the document, which is text-based, is accurately processed.

DarioBernardo added the bug Something isn't working label Apr 26, 2024

scanny added the ocr Related to optical character recognition (OCR). label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

DarioBernardo commented Apr 26, 2024

christinestraub commented Apr 26, 2024

DarioBernardo commented Apr 26, 2024

DarioBernardo commented Apr 29, 2024

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Comments

DarioBernardo commented Apr 26, 2024

christinestraub commented Apr 26, 2024

DarioBernardo commented Apr 26, 2024

DarioBernardo commented Apr 29, 2024