Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not woking on "nytimes.com" #363

Open
Praveena0989 opened this issue May 4, 2017 · 12 comments · May be fixed by #885
Open

Not woking on "nytimes.com" #363

Praveena0989 opened this issue May 4, 2017 · 12 comments · May be fixed by #885

Comments

@Praveena0989
Copy link

I tried few articles from NYtimes.com but it is able to parse half article and missing first half
Example urls:
url 1
url2

@dlundergreen
Copy link

Did you check the website to make sure that you haven't reached the max free articles that you are allowed to see for the month?

@Praveena0989
Copy link
Author

Praveena0989 commented May 24, 2017

@dlundergreen I don't remember when was the last time I opened NYTimes before. That means I am sure not crossed the limit.

@sskadamb
Copy link

sskadamb commented Jul 5, 2017

This also happens for other links. For example, on this URL only a part of the body is parsed. Is this because the individual <p> elements are in different parent <div>'s?

@Cabu
Copy link

Cabu commented Sep 5, 2017

NYTimes articles are over 2 DIVs and generally the second one is bigger making newspaper picking it.

@ghost
Copy link

ghost commented Nov 5, 2018

anyone was able to solve this ?

@Cabu
Copy link

Cabu commented Nov 5, 2018

I found that changing PARENT_DECAY to 1.0 make it for NYT

@ghost
Copy link

ghost commented Nov 6, 2018

@Cabu I couldn't found a variable named PARENT_DECAY on master branch, so where is this located ?

@Cabu
Copy link

Cabu commented Nov 6, 2018

@loaighoraba
paper = newspaper.build(source_url, PARENT_DECAY=1.0)

@ghost
Copy link

ghost commented Nov 6, 2018

@Cabu seems this is changed in the master branch, there is no such variable.

@Cabu
Copy link

Cabu commented Nov 6, 2018

@loaighoraba
Ho yes. I see, now it seems to be hardcoded in extractor.py line 825 :/
Having it as a 'hidden' feature was practical for sources like the NYT.

@ghost
Copy link

ghost commented Nov 6, 2018

@Cabu I see, however this won't solve the issue if the common parent is more than two levels up, thanks for this anyway.

@jecarr jecarr linked a pull request May 10, 2021 that will close this issue
@jecarr
Copy link

jecarr commented May 10, 2021

Not sure if anyone is watching for updates on this issue but my linked PR has been tested with both URLs here. Happy to hear feedback/suggestions on it 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants