Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not woking on "nytimes.com" #90

Open
AndyTheFactory opened this issue Oct 24, 2023 · 12 comments
Open

Not woking on "nytimes.com" #90

AndyTheFactory opened this issue Oct 24, 2023 · 12 comments

Comments

@AndyTheFactory
Copy link
Owner

Issue by Praveena0989
Thu May 4 19:28:24 2017
Originally opened as codelucas/newspaper#363


I tried few articles from NYtimes.com but it is able to parse half article and missing first half
Example urls:
url 1
url2

@AndyTheFactory
Copy link
Owner Author

Comment by dlundergreen
Wed May 24 15:14:17 2017


Did you check the website to make sure that you haven't reached the max free articles that you are allowed to see for the month?

@AndyTheFactory
Copy link
Owner Author

Comment by Praveena0989
Wed May 24 17:59:38 2017


@dlundergreen I don't remember when was the last time I opened NYTimes before. That means I am sure not crossed the limit.

@AndyTheFactory
Copy link
Owner Author

Comment by sskadamb
Wed Jul 5 19:22:17 2017


This also happens for other links. For example, on this URL only a part of the body is parsed. Is this because the individual <p> elements are in different parent <div>'s?

@AndyTheFactory
Copy link
Owner Author

Comment by Cabu
Tue Sep 5 09:31:28 2017


NYTimes articles are over 2 DIVs and generally the second one is bigger making newspaper picking it.

@AndyTheFactory
Copy link
Owner Author

Comment by ghost
Mon Nov 5 10:01:07 2018


anyone was able to solve this ?

@AndyTheFactory
Copy link
Owner Author

Comment by Cabu
Mon Nov 5 20:47:52 2018


I found that changing PARENT_DECAY to 1.0 make it for NYT

@AndyTheFactory
Copy link
Owner Author

Comment by ghost
Tue Nov 6 09:55:46 2018


@Cabu I couldn't found a variable named PARENT_DECAY on master branch, so where is this located ?

@AndyTheFactory
Copy link
Owner Author

Comment by Cabu
Tue Nov 6 10:05:04 2018


@loaighoraba
paper = newspaper.build(source_url, PARENT_DECAY=1.0)

@AndyTheFactory
Copy link
Owner Author

Comment by ghost
Tue Nov 6 10:21:44 2018


@Cabu seems this is changed in the master branch, there is no such variable.

@AndyTheFactory
Copy link
Owner Author

Comment by Cabu
Tue Nov 6 13:21:46 2018


@loaighoraba
Ho yes. I see, now it seems to be hardcoded in extractor.py line 825 :/
Having it as a 'hidden' feature was practical for sources like the NYT.

@AndyTheFactory
Copy link
Owner Author

Comment by ghost
Tue Nov 6 14:30:39 2018


@Cabu I see, however this won't solve the issue if the common parent is more than two levels up, thanks for this anyway.

@AndyTheFactory
Copy link
Owner Author

Comment by jecarr
Mon May 10 05:57:46 2021


Not sure if anyone is watching for updates on this issue but my linked PR has been tested with both URLs here. Happy to hear feedback/suggestions on it 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant