New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/ extract style or font for Text elements. #2695
Comments
@LunaticMaestro font style is stored in |
Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that. It is supported for DOCX however if that's a help. |
I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements. Find the DOCX file attached for purpose of reproduing. |
@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported. However, text that is made bold or italic directly, using the toolbar buttons is properly detected. I added the following paragraph to the document: {
'category_depth': 0,
'emphasized_text_contents': ['bold', 'italic'],
'emphasized_text_tags': ['b', 'i'],
'last_modified': '2024-03-27T22:03:51',
'languages': ['eng'],
'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
'file_directory': '/Users/scanny/Desktop',
'filename': 'redacted.docx',
'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
} |
Since unstructured re-uses |
I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.
Is the font-style extraction planned in future?
The text was updated successfully, but these errors were encountered: