Include all nodes with text #885

jecarr · 2021-05-10T05:55:59Z

Closes #363

To tackle missing article paragraphs, this suggestion considers any node with text to be included in the final text attribute of an article instance
Test cases pass with a warning where extra text has been found (i.e. equal-text asserts fail) but main article text has been found within parsed article text

Edit - found more missing text when using this url. This is because there are < li >s not being gathered. Plus the < table > at the bottom of the page didn't translate to text well. As these are further fixes (that may break how this PR fixes for other urls), my fixes for this are in jecarr#1

shawei3000 · 2021-05-12T04:52:15Z

i followed the above steps, and updated newspaper/* files in my specific anaconda env, and still experience significant missing paragraphs for this url, ( https://www.stltoday.com/news/local/crime-and-courts/belleville-man-gets-20-years-for-ponzi-scheme/article_194000a6-1a13-5841-b53a-44305142bd23.html ), maybe this is different scenario?

jecarr · 2021-05-13T03:10:45Z

Hey @shawei3000 - thanks for the feedback. You are right, it was a new test case for me. I was used to seeing the missing text after the text newspaper chose. That article gave me the first half of the article being the missing text (not the latter). So my first fix produced text where the article's order messed up.

I've updated the PR but the html attribute needs updating. The text attribute should have most of that URL's article text in order (the first sentence appears to not be picking up, I'll look into that too).

Thanks again for the heads up! @codelucas, feel free to highlight if my approach needs refining.

jecarr added 4 commits May 7, 2021 18:20

Included nodes with text in article.parse()

1764420

Included nodes with text in html of article.parse()

7887cd3

Fixed typo from 7887cd3

c8e1e84

Updated unit tests

962ca5e

jecarr mentioned this pull request May 10, 2021

Not working on New York Times #729

Open

jecarr added 2 commits May 13, 2021 14:48

fixed order of missing text being added to article

7f2f436

Fixed failing test after 7f2f436

223c928

jecarr added 2 commits May 14, 2021 11:39

Update article html by inserting missing text in order

b5d0fc6

Improved regex code in insert_missing_html()

c3f3076

jecarr mentioned this pull request May 20, 2021

Issue with running tram.py mitre-attack/tram#60

Closed

Replaced regex searches with xpath in insert_missing_html()

1fc9cc6

jecarr mentioned this pull request Jun 2, 2021

Include missing report text mitre-attack/tram#83

Closed

AndyTheFactory mentioned this pull request Oct 24, 2023

Include all nodes with text AndyTheFactory/newspaper4k#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include all nodes with text #885

Include all nodes with text #885

jecarr commented May 10, 2021 •

edited

shawei3000 commented May 12, 2021

jecarr commented May 13, 2021

Include all nodes with text #885

Are you sure you want to change the base?

Include all nodes with text #885

Conversation

jecarr commented May 10, 2021 • edited

shawei3000 commented May 12, 2021

jecarr commented May 13, 2021

jecarr commented May 10, 2021 •

edited