feat/ extract style or font for Text elements. #2695

LunaticMaestro · 2024-03-26T06:24:55Z

I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.

Is the font-style extraction planned in future?

scanny · 2024-03-26T17:35:26Z

@LunaticMaestro font style is stored in .metadata.emphasized_text_contents and .metadata.emphasized_text_tags. Did you look there?

LunaticMaestro · 2024-03-27T03:10:18Z

Hi scanny,
Thanks for reply. Unfortunately, the suggested metadata does not contain the requested content.

Find the screenshot attached.

I am using the PDF from example docs example-docs/layout-parser-paper.pdf

scanny · 2024-03-27T22:25:38Z

Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that.

It is supported for DOCX however if that's a help.

LunaticMaestro · 2024-03-28T04:11:55Z

I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements.

Find the DOCX file attached for purpose of reproduing.
redacted.docx

scanny · 2024-03-28T05:10:19Z

@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported.

However, text that is made bold or italic directly, using the toolbar buttons is properly detected.

I added the following paragraph to the document:
"This is a paragraph that has some bold and some italic.", with the words "bold" and "italic" formatted with the toolbar buttons and it produces the following metadata:

{
    'category_depth': 0,
    'emphasized_text_contents': ['bold', 'italic'],
    'emphasized_text_tags': ['b', 'i'],
    'last_modified': '2024-03-27T22:03:51',
    'languages': ['eng'],
    'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
    'file_directory': '/Users/scanny/Desktop',
    'filename': 'redacted.docx',
    'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
}

LunaticMaestro · 2024-03-28T05:54:32Z

Since unstructured re-uses pdfminer reference. I am expecting for native implementations of pdf miner to get the character properties, example: pdf miner character style.

LunaticMaestro added the enhancement New feature or request label Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/ extract style or font for Text elements. #2695

feat/ extract style or font for Text elements. #2695

LunaticMaestro commented Mar 26, 2024

scanny commented Mar 26, 2024

LunaticMaestro commented Mar 27, 2024

scanny commented Mar 27, 2024

LunaticMaestro commented Mar 28, 2024

scanny commented Mar 28, 2024

LunaticMaestro commented Mar 28, 2024

feat/ extract style or font for Text elements. #2695

feat/ extract style or font for Text elements. #2695

Comments

LunaticMaestro commented Mar 26, 2024

scanny commented Mar 26, 2024

LunaticMaestro commented Mar 27, 2024

scanny commented Mar 27, 2024

LunaticMaestro commented Mar 28, 2024

scanny commented Mar 28, 2024

LunaticMaestro commented Mar 28, 2024