Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/ extract style or font for Text elements. #2695

Open
LunaticMaestro opened this issue Mar 26, 2024 · 6 comments
Open

feat/ extract style or font for Text elements. #2695

LunaticMaestro opened this issue Mar 26, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@LunaticMaestro
Copy link

I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.

Is the font-style extraction planned in future?

@LunaticMaestro LunaticMaestro added the enhancement New feature or request label Mar 26, 2024
@scanny
Copy link
Collaborator

scanny commented Mar 26, 2024

@LunaticMaestro font style is stored in .metadata.emphasized_text_contents and .metadata.emphasized_text_tags. Did you look there?

@LunaticMaestro
Copy link
Author

Hi scanny,
Thanks for reply. Unfortunately, the suggested metadata does not contain the requested content.

Find the screenshot attached.

I am using the PDF from example docs example-docs/layout-parser-paper.pdf

image

@scanny
Copy link
Collaborator

scanny commented Mar 27, 2024

Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that.

It is supported for DOCX however if that's a help.

@LunaticMaestro
Copy link
Author

I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements.

Find the DOCX file attached for purpose of reproduing.
redacted.docx

image

@scanny
Copy link
Collaborator

scanny commented Mar 28, 2024

@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported.

However, text that is made bold or italic directly, using the toolbar buttons is properly detected.

I added the following paragraph to the document:
"This is a paragraph that has some bold and some italic.", with the words "bold" and "italic" formatted with the toolbar buttons and it produces the following metadata:

{
    'category_depth': 0,
    'emphasized_text_contents': ['bold', 'italic'],
    'emphasized_text_tags': ['b', 'i'],
    'last_modified': '2024-03-27T22:03:51',
    'languages': ['eng'],
    'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
    'file_directory': '/Users/scanny/Desktop',
    'filename': 'redacted.docx',
    'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
}

@LunaticMaestro
Copy link
Author

Since unstructured re-uses pdfminer reference. I am expecting for native implementations of pdf miner to get the character properties, example: pdf miner character style.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants