Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Older News Articles #973

Open
PaulKMandal opened this issue Sep 13, 2023 · 4 comments
Open

Getting Older News Articles #973

PaulKMandal opened this issue Sep 13, 2023 · 4 comments

Comments

@PaulKMandal
Copy link

Hello, seven years ago this was posted: #245

I have a problem that requires me to scrape a large corpus of titles from 2013-2019 from various news sources. Ideally I would like to scrape 10 articles per date through this date range. The issue that I have is that newspaper only pulls the latest results. Does anyone have any insights on how to achieve this? Thanks!

@banagale
Copy link

Paul, out of curiosity, can you share why you’re trying to use this package instead of scrapy?

@johnbumgarner
Copy link

Paul, can you share an example of what you are trying to do?

@PaulKMandal
Copy link
Author

I apologize for the delay.

Paul, out of curiosity, can you share why you’re trying to use this package instead of scrapy?

Because NewsPaper has very robust functionality for scraping News Articles.

Paul, can you share an example of what you are trying to do?

Ideally, I want to be able to specify a date and scrape the news articles from a certain date. I wrote an implementation that pulls articles from Archive.org. My implementation is available here and it works as intended, but Archive.org can be slow and often times out.

@johnbumgarner
Copy link

johnbumgarner commented Sep 24, 2023

So you want to search the wayback archives. I wrote an example on this in my overview document for NewsPaper. If you provide me some more details I will add another example on searching a resource (website) for random articles for a certain date(s).

FYI the wayback archives are always slow for scraping. There might be a way to gain some performance, but that would require testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants