Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include all nodes with text #885

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

jecarr
Copy link

@jecarr jecarr commented May 10, 2021

Closes #363

  • To tackle missing article paragraphs, this suggestion considers any node with text to be included in the final text attribute of an article instance
  • Test cases pass with a warning where extra text has been found (i.e. equal-text asserts fail) but main article text has been found within parsed article text

Edit - found more missing text when using this url. This is because there are < li >s not being gathered. Plus the < table > at the bottom of the page didn't translate to text well. As these are further fixes (that may break how this PR fixes for other urls), my fixes for this are in jecarr#1

@shawei3000
Copy link

i followed the above steps, and updated newspaper/* files in my specific anaconda env, and still experience significant missing paragraphs for this url, ( https://www.stltoday.com/news/local/crime-and-courts/belleville-man-gets-20-years-for-ponzi-scheme/article_194000a6-1a13-5841-b53a-44305142bd23.html ), maybe this is different scenario?

@jecarr
Copy link
Author

jecarr commented May 13, 2021

Hey @shawei3000 - thanks for the feedback. You are right, it was a new test case for me. I was used to seeing the missing text after the text newspaper chose. That article gave me the first half of the article being the missing text (not the latter). So my first fix produced text where the article's order messed up.

I've updated the PR but the html attribute needs updating. The text attribute should have most of that URL's article text in order (the first sentence appears to not be picking up, I'll look into that too).

Thanks again for the heads up! @codelucas, feel free to highlight if my approach needs refining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not woking on "nytimes.com"
2 participants