Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip unparsable urls #505

Closed
AndyTheFactory opened this issue Oct 24, 2023 · 0 comments
Closed

Skip unparsable urls #505

AndyTheFactory opened this issue Oct 24, 2023 · 0 comments
Labels
bug Something isn't working PR-verify Has a PR, must be checked
Milestone

Comments

@AndyTheFactory
Copy link
Owner

Issue by frankier
Thu Feb 11 09:57:42 2021
Originally opened as codelucas/newspaper#872


I get problems with some image urls when using news-please:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/newsplease/crawler/commoncrawl_extractor.py", line 259, in _
_process_warc_gz_file
    filter_pass, article = self.filter_record(record)
  File "/opt/coviddash/ingress/covidmarch.py", line 24, in filter_record
    passed_filters, article = super().filter_record(warc_record, article)
  File "/usr/local/lib/python3.7/dist-packages/newsplease/crawler/commoncrawl_extractor.py", line 123, in f
ilter_record
    article = self._from_warc(warc_record)
  File "/usr/local/lib/python3.7/dist-packages/newsplease/crawler/commoncrawl_extractor.py", line 235, in _
from_warc
    return NewsPlease.from_warc(record, decode_errors="replace" if self.__ignore_unicode_errors else "stric
t", fetch_images=self.__fetch_images)
  File "/usr/local/lib/python3.7/dist-packages/newsplease/__init__.py", line 55, in from_warc
    article = NewsPlease.from_html(html, url=url, download_date=download_date, fetch_images=fetch_images)
  File "/usr/local/lib/python3.7/dist-packages/newsplease/__init__.py", line 95, in from_html
    item = extractor.extract(item)
  File "/usr/local/lib/python3.7/dist-packages/newsplease/pipeline/extractor/article_extractor.py", line 63
, in extract
    article_candidate = extractor.extract(item)
  File "/usr/local/lib/python3.7/dist-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor
.py", line 33, in extract
    article.parse()
  File "/usr/local/lib/python3.7/dist-packages/newspaper/article.py", line 261, in parse
    self.fetch_images()
  File "/usr/local/lib/python3.7/dist-packages/newspaper/article.py", line 272, in fetch_images
    imgs = self.extractor.get_img_urls(self.url, self.clean_doc)
  File "/usr/local/lib/python3.7/dist-packages/newspaper/extractors.py", line 570, in get_img_urls
    for url in urls])
  File "/usr/local/lib/python3.7/dist-packages/newspaper/extractors.py", line 570, in <listcomp>
    for url in urls])
  File "/usr/lib/python3.7/urllib/parse.py", line 511, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "/usr/lib/python3.7/urllib/parse.py", line 368, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python3.7/urllib/parse.py", line 435, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL
ERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'ValueError'> (Invalid IPv6 URL)

frankier included the following code: https://github.com/codelucas/newspaper/pull/872/commits

@AndyTheFactory AndyTheFactory added bug Something isn't working PR-verify Has a PR, must be checked labels Oct 30, 2023
@AndyTheFactory AndyTheFactory added this to the Release 0.9.1 milestone Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PR-verify Has a PR, must be checked
Projects
None yet
Development

No branches or pull requests

1 participant