How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172

karndeepsingh · 2023-03-11T09:30:15Z

Hi,
I have been trying to implement the Newspaper navigator model for my application. However, it is able to detect the regions like title or whole article. But I want to extract title and its below paragraphs for my usecase. How I can do that? Please help me to resolve this issue. Is their any tutorial available to guide on it?

Thanks

nkoudounas · 2023-10-04T09:29:04Z

You are asking for a complete document layout task! This is not an issue, its a task. Combine object detection (bigger bboxes) with pdf_parser output (bboxes for every word or line). Filter the lines/words output by the bigger boxes predicted by Vision Models. You can leverage spatial correlation (sort by width, then height) to identify words in the same line or a heading above a paragraph (heading will be one-liner, identified a bbox with bigger area than others plus height of heading < height of paragraph). Hope that helps 👯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172

How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172

karndeepsingh commented Mar 11, 2023

nkoudounas commented Oct 4, 2023 •

edited

How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172

How I can extract Titles, Headers , Photos and respective article information from Newspaper? #172

Comments

karndeepsingh commented Mar 11, 2023

nkoudounas commented Oct 4, 2023 • edited

nkoudounas commented Oct 4, 2023 •

edited