Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getArticleUris sometimes null sometimes works (based on order / amount of urls) #63

Open
Palmik opened this issue Oct 3, 2023 · 10 comments

Comments

@Palmik
Copy link

Palmik commented Oct 3, 2023

Example (this happens for both the Python and REST API (as the Python just calls the REST API directly)

Multiple URLs (the dailymail will get null -- only if it's second, it works if it's first!):

curl --request POST \
     --url "http://eventregistry.org/api/v1/articleMapper" \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
    "articleUrl": [
        "https://www.business-standard.com/article/pti-stories/japan-eyes-record-defence-budget-amid-n-korea-china-threats-118083100326_1.html",
        "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
    ],
    "includeAllVersions": true,
    "deep": true,
    "apiKey": "XXX"
}
{
    "https://www.business-standard.com/article/pti-stories/japan-eyes-record-defence-budget-amid-n-korea-china-threats-118083100326_1.html": "936069503",
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": null
}

Single URL (the dailymail will be mapped):

curl --request POST \
     --url "http://eventregistry.org/api/v1/articleMapper" \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
    "articleUrl": [
        "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
    ],
    "includeAllVersions": true,
    "deep": true,
    "apiKey": "XXX"
}
{
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": "7763074647"
}
@Palmik
Copy link
Author

Palmik commented Oct 3, 2023

Another interesting example:

{

  "articleUrl": [
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490",
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
  ],
  "includeAllVersions": true,
  "deep": true,
  "apiKey": "XXX"
}
{
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460",
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": "7763074647"
}

VS

{

  "articleUrl": [
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html",
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490"
  ],
  "includeAllVersions": true,
  "deep": true,
  "apiKey": "XXX"
}
{
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460"
}

@gregorleban
Copy link
Collaborator

There doesn't seem to be an error related to this API call.

The article that we have in our DB is "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490". When mapping to the URI we also create alternative versions of the urls that we test. One version is without the parameters. Another version is without the "www." prefix.

The URI that you receive is the URI of the article that we have in our database.

Regarding the first reported issue (i.e. not returning uri when providing multiple urls):

In your case, it seems that you've made first the call with a single url (https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html) and later repeated the query with multiple urls.
The thing is that the article with this url was found to be a duplicate of the article https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490 which we have found and imported already before. Therefore the article with url https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html (and uri 7763074647) was removed before you made the second query with multiple urls.

I hope this explains the confusion.

@Palmik
Copy link
Author

Palmik commented Oct 11, 2023

Hi Greg, thanks for the answer.

Unfortunately all of the example URLs from my original message now return null (this seems like a separate issue), so it's hard to verify. But I seem to recall being able to reproduce this behaviour with the same URLs.

What I would like to achieve is:

  • Take a list of my own URLs (these might or might not extra query params in them)
  • Map them to current equivalent newsapi URI. If you have determined that https://example.com/foo is the same as https://example.com/foo?bar=1, then both of these URLs should return some (the same?) URI.

As you see, getArticleUris does not seem to be robust to query parameter variations. In the last example call, I only got URI for "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490", and not the duplicate. (Whereas in the previous call, I got URI for both).

Since you identified "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" to be duplicate of "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490", why not also return the URI for it?

As it stands, I am not sure how to use the API to reliably get back URIs.

@gregorleban
Copy link
Collaborator

{ 
	"articleUrl": [
        "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490"
    ],  
    "includeAllVersions": true,
    "deep": true,
    "apiKey": "{{ _.prodApiKey }}"
}  

this call does not return null, since that is the url that we have in the db.

What you would like to achieve is generally exactly what the article mapper is for. The only issue is that if you have a url that we don't have in the DB, then we cannot return it.

If you provide a url that is not exactly the url that we have in the DB, then in some cases we can resolve the issue and in some not.
If you have:
a.com/b/c?x=123
when we actually store url:
a.com/b/c
then you will receive a valid URI from us since we also try resolving to urls without the params.

If, on the other hand, we store url
a.com/b/c?x=123
and you provide us url:
a.com/b/c
then we cannot provide you a valid URI since we don't do approximate searches in our DB and we cannot guess the extra params to your url that would then match our url.

We cannot return you the URI for https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html as this article was removed from the DB and we have no record of it anymore.

My suggestion is that you use the API for your articles. You then take the articles for which you get a valid URI and for the remaining ones you call the https://newsapi.ai/documentation?tab=extractArticleInfo endpoint.

Do you have a particular reason why you need specifically the articleMapper?

@Palmik
Copy link
Author

Palmik commented Oct 11, 2023

Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing.

Out of curiosity, why would the article get deleted?

@gregorleban
Copy link
Collaborator

Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing.

Ok. Does the endpoint that I suggested for you (https://newsapi.ai/documentation?tab=extractArticleInfo) therefore work for your purposes?

Out of curiosity, why would the article get deleted?

The articles that get deleted are duplicated articles that come from the same source. So if we see that we imported the same article with a different url multiple times, we remove such duplicates since they bring no value to any user.

@Palmik
Copy link
Author

Palmik commented Oct 12, 2023

Yes, that endpoint returns the article content even for the URLs where ArticleMapper returns null (which is still something I don't understand the reason of -- why could not ArticleMapper use the same URL -> URI resolution logic?). However, it's ~9 times more expensive compared to ArticleMapper + GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I need 11 tokens), so it won't be feasible for our usecase.

@gregorleban
Copy link
Collaborator

gregorleban commented Oct 12, 2023 via email

@Palmik
Copy link
Author

Palmik commented Oct 12, 2023

I see, that's great to know about ExtractArticleInfo token usage, seems even better and easier than the ArticleMapper + GetArticle. Based on this I consider my issue resolved.

But I have to say the API is quite unintuitive in this regard. I see no reason why e.g. "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" would return null with ArticleMapper, yet ExtractArticleInfo has no problem finding the URL. So given that ExtractArticleInfo has the information, your system knows about the URL.

@gregorleban
Copy link
Collaborator

gregorleban commented Oct 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants