TIPS FOR IMPROVEMENT #978

aleksandar-devedzic · 2023-11-16T18:52:21Z

I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in :

meta[property='{}']
meta[name='{}']
meta[itemprop='{}']

Meta tags for publication and modification date:

published_date
published_time
cXenseParse:publishtime
pubdate
publish_date
PublishDate
dcterms.created
rnews:datePublished
article:published_time
prism.publicationDate
displaydate
OriginalPublicationDate
og:published_time
datePublished
article_date_original
article.published
published_time_telegram
sailthru.date
datePublished
date
Date
original-publish-date
DC.date.issued
dc.date
DC.Date
parsely-pub-date
publishtime
publication_date
uploadDate
coverageEndTime
publishdate
publish-date
publishedAtDate
dcterms.date
publishedDate
creationDateTime
pub_date
updated_time
og:updated_time
datemodified
last-modified
Last-Modified
DC.date.modified
article:modified_time
modified_time
modifiedDateTime
dc.dcterms.modified
lastmod

Meta tags for title:

dc.title
og:title
headline
articletitle
article-title
parsely-title
title

Meta tags for description:

description
og:description

Meta tags for body:
articleBody
articleText

FYI
It would be good if you can fix/improve/adapt the code so that it can extract full information from these websites since these websites are the most popular websites in the world.
By "full information" i mean title, publication date and article body

CNN - https://edition.cnn.com/
BBC News - https://www.bbc.com/news
Reuters - https://www.reuters.com/
The New York Times - https://www.nytimes.com/
The Guardian - https://www.theguardian.com/international
Al Jazeera - https://www.aljazeera.com/
Associated Press (AP) News - https://apnews.com/
NBC News - https://www.nbcnews.com/
Fox News - https://www.foxnews.com/
USA Today - https://www.usatoday.com/
ABC News - https://abcnews.go.com/
CBS News - https://www.cbsnews.com/
The Washington Post - https://www.washingtonpost.com/
Time - https://time.com/
Forbes - https://www.forbes.com/
Bloomberg - https://www.bloomberg.com/
The Wall Street Journal - https://www.wsj.com/
The Huffington Post - https://www.huffpost.com/
The Independent - https://www.independent.co.uk/
The Sydney Morning Herald - https://www.smh.com.au/
The Economist - https://www.economist.com/
The Times of India - https://timesofindia.indiatimes.com/
The Daily Mail - https://www.dailymail.co.uk/home/index.html
The Telegraph - https://www.telegraph.co.uk/
The Sun - https://www.thesun.co.uk/
The Mirror - https://www.mirror.co.uk/
The Daily Beast - https://www.thedailybeast.com/
The Atlantic - https://www.theatlantic.com/
National Geographic - https://www.nationalgeographic.com/
Science Daily - https://www.sciencedaily.com/
The Verge - https://www.theverge.com/
Wired - https://www.wired.com/
TechCrunch - https://techcrunch.com/
Engadget - https://www.engadget.com/
Mashable - https://mashable.com/
Forbes India - https://www.forbesindia.com/
Hindustan Times - https://www.hindustantimes.com/
CNN Business - https://www.cnn.com/business
Financial Times - https://www.ft.com/
CNBC - https://www.cnbc.com/
Business Insider - https://www.businessinsider.com/
Politico - https://www.politico.eu/
The Hill - https://thehill.com/
The Washington Times - https://www.washingtontimes.com/
The Boston Globe - https://www.bostonglobe.com/
The LA Times - https://www.latimes.com/
The Chicago Tribune - https://www.chicagotribune.com/
The Sydney Morning Herald - https://www.smh.com.au/
The Globe and Mail - https://www.theglobeandmail.com/
The Toronto Star - https://www.thestar.com/

AndyTheFactory · 2023-11-16T21:26:13Z

Hi @aleksandar-devedzic ,
i forked newspaper3k and in the next version your suggestions are implemented (code is at the moment in the work-0.9.2 branch, but if you need it, you can pull it from there. alternatively, you can wait for the release ;)

here is my fork
https://github.com/AndyTheFactory/newspaper4k

aleksandar-devedzic · 2023-11-16T21:29:08Z

Oh, thanks I hope that I helped you with this. All best

…

On Thu, Nov 16, 2023 at 10:26 PM Andrei P. ***@***.***> wrote: Hi @aleksandar-devedzic <https://github.com/aleksandar-devedzic> , i forked newspaper3k and in the next version your suggestions are implemented (code is at the moment in the work-0.9.2 branch, but if you need it, you can pull it from there. alternatively, you can wait for the release ;) here is my fork https://github.com/AndyTheFactory/newspaper4k — Reply to this email directly, view it on GitHub <#978 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATCV65J6TZYTQGMGX2NJOSLYE2AIDAVCNFSM6AAAAAA7OUG6XGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGMZTSNRYGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2dareis2do · 2024-03-24T21:57:35Z

I had issue with bbc where the nest p tags in divs. Newspaper4k seems to work perfectly after installing typing-extensions e.g.
pip install typing-extensions

Thanks

aleksandar-devedzic · 2024-03-24T23:33:10Z

One more improvement (about dates)...
Sometimes you can find publication dates in URL
so you can also check that as a last option...

2dareis2do · 2024-03-27T13:48:01Z

On dates why does bbc article prepend date to _text string now?

e.g.

Published\n\n8 March\n\n

source https://www.bbc.co.uk/news/uk-england-london-68511760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIPS FOR IMPROVEMENT #978

TIPS FOR IMPROVEMENT #978

aleksandar-devedzic commented Nov 16, 2023

AndyTheFactory commented Nov 16, 2023

aleksandar-devedzic commented Nov 16, 2023 via email

2dareis2do commented Mar 24, 2024 •

edited

aleksandar-devedzic commented Mar 24, 2024

2dareis2do commented Mar 27, 2024 •

edited

TIPS FOR IMPROVEMENT #978

TIPS FOR IMPROVEMENT #978

Comments

aleksandar-devedzic commented Nov 16, 2023

AndyTheFactory commented Nov 16, 2023

aleksandar-devedzic commented Nov 16, 2023 via email

2dareis2do commented Mar 24, 2024 • edited

aleksandar-devedzic commented Mar 24, 2024

2dareis2do commented Mar 27, 2024 • edited

2dareis2do commented Mar 24, 2024 •

edited

2dareis2do commented Mar 27, 2024 •

edited