Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172

Open
karndeepsingh opened this issue Mar 11, 2023 · 1 comment

Comments

@karndeepsingh
Copy link

Hi,
I have been trying to implement the Newspaper navigator model for my application. However, it is able to detect the regions like title or whole article. But I want to extract title and its below paragraphs for my usecase. How I can do that? Please help me to resolve this issue. Is their any tutorial available to guide on it?

Thanks

@nkoudounas
Copy link

nkoudounas commented Oct 4, 2023

You are asking for a complete document layout task! This is not an issue, its a task. Combine object detection (bigger bboxes) with pdf_parser output (bboxes for every word or line). Filter the lines/words output by the bigger boxes predicted by Vision Models. You can leverage spatial correlation (sort by width, then height) to identify words in the same line or a heading above a paragraph (heading will be one-liner, identified a bbox with bigger area than others plus height of heading < height of paragraph). Hope that helps 👯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants