Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIPS FOR IMPROVEMENT #978

Open
aleksandar-devedzic opened this issue Nov 16, 2023 · 5 comments
Open

TIPS FOR IMPROVEMENT #978

aleksandar-devedzic opened this issue Nov 16, 2023 · 5 comments

Comments

@aleksandar-devedzic
Copy link

I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in :

meta[property='{}']
meta[name='{}']
meta[itemprop='{}']

Meta tags for publication and modification date:

published_date
published_time
cXenseParse:publishtime
pubdate
publish_date
PublishDate
dcterms.created
rnews:datePublished
article:published_time
prism.publicationDate
displaydate
OriginalPublicationDate
og:published_time
datePublished
article_date_original
article.published
published_time_telegram
sailthru.date
datePublished
date
Date
original-publish-date
DC.date.issued
dc.date
DC.Date
parsely-pub-date
publishtime
publication_date
uploadDate
coverageEndTime
publishdate
publish-date
publishedAtDate
dcterms.date
publishedDate
creationDateTime
pub_date
updated_time
og:updated_time
datemodified
last-modified
Last-Modified
DC.date.modified
article:modified_time
modified_time
modifiedDateTime
dc.dcterms.modified
lastmod

Meta tags for title:

dc.title
og:title
headline
articletitle
article-title
parsely-title
title

Meta tags for description:

description
og:description

Meta tags for body:
articleBody
articleText

FYI
It would be good if you can fix/improve/adapt the code so that it can extract full information from these websites since these websites are the most popular websites in the world.
By "full information" i mean title, publication date and article body

CNN - https://edition.cnn.com/
BBC News - https://www.bbc.com/news
Reuters - https://www.reuters.com/
The New York Times - https://www.nytimes.com/
The Guardian - https://www.theguardian.com/international
Al Jazeera - https://www.aljazeera.com/
Associated Press (AP) News - https://apnews.com/
NBC News - https://www.nbcnews.com/
Fox News - https://www.foxnews.com/
USA Today - https://www.usatoday.com/
ABC News - https://abcnews.go.com/
CBS News - https://www.cbsnews.com/
The Washington Post - https://www.washingtonpost.com/
Time - https://time.com/
Forbes - https://www.forbes.com/
Bloomberg - https://www.bloomberg.com/
The Wall Street Journal - https://www.wsj.com/
The Huffington Post - https://www.huffpost.com/
The Independent - https://www.independent.co.uk/
The Sydney Morning Herald - https://www.smh.com.au/
The Economist - https://www.economist.com/
The Times of India - https://timesofindia.indiatimes.com/
The Daily Mail - https://www.dailymail.co.uk/home/index.html
The Telegraph - https://www.telegraph.co.uk/
The Sun - https://www.thesun.co.uk/
The Mirror - https://www.mirror.co.uk/
The Daily Beast - https://www.thedailybeast.com/
The Atlantic - https://www.theatlantic.com/
National Geographic - https://www.nationalgeographic.com/
Science Daily - https://www.sciencedaily.com/
The Verge - https://www.theverge.com/
Wired - https://www.wired.com/
TechCrunch - https://techcrunch.com/
Engadget - https://www.engadget.com/
Mashable - https://mashable.com/
Forbes India - https://www.forbesindia.com/
Hindustan Times - https://www.hindustantimes.com/
CNN Business - https://www.cnn.com/business
Financial Times - https://www.ft.com/
CNBC - https://www.cnbc.com/
Business Insider - https://www.businessinsider.com/
Politico - https://www.politico.eu/
The Hill - https://thehill.com/
The Washington Times - https://www.washingtontimes.com/
The Boston Globe - https://www.bostonglobe.com/
The LA Times - https://www.latimes.com/
The Chicago Tribune - https://www.chicagotribune.com/
The Sydney Morning Herald - https://www.smh.com.au/
The Globe and Mail - https://www.theglobeandmail.com/
The Toronto Star - https://www.thestar.com/

@AndyTheFactory
Copy link

Hi @aleksandar-devedzic ,
i forked newspaper3k and in the next version your suggestions are implemented (code is at the moment in the work-0.9.2 branch, but if you need it, you can pull it from there. alternatively, you can wait for the release ;)

here is my fork
https://github.com/AndyTheFactory/newspaper4k

@aleksandar-devedzic
Copy link
Author

aleksandar-devedzic commented Nov 16, 2023 via email

@2dareis2do
Copy link

2dareis2do commented Mar 24, 2024

I had issue with bbc where the nest p tags in divs. Newspaper4k seems to work perfectly after installing typing-extensions e.g.
pip install typing-extensions

Thanks

@aleksandar-devedzic
Copy link
Author

One more improvement (about dates)...
Sometimes you can find publication dates in URL
so you can also check that as a last option...

@2dareis2do
Copy link

2dareis2do commented Mar 27, 2024

On dates why does bbc article prepend date to _text string now?

e.g.

Published\n\n8 March\n\n

source https://www.bbc.co.uk/news/uk-england-london-68511760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants