Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tolerance for unknown/ bad charsets #320

Open
cybergaukler opened this issue Aug 1, 2019 · 1 comment
Open

tolerance for unknown/ bad charsets #320

cybergaukler opened this issue Aug 1, 2019 · 1 comment

Comments

@cybergaukler
Copy link

this is a fringe case since the crawler works wonderfully 99.9% of the time

as I was checking some results that did not work I came across the (german) site I initially got no content from:
http://www.lorei-baustoffe.de/

The problem was that it returned the following in the header:
Content-Type: text/html; charset=none

This in term threw an error since iconv does not know "none" as an encoding

I did a rough patch in my code to change "none" into "utf-8" and got the site.

I am not sure if this is would be a desired feature for the crawler as well.

  • pro: you would not need to re-crawl on charset errors
  • con: you would not be able to see that such an error exists
@cybergaukler cybergaukler changed the title charset tolerance for tolerance for unknown/ bad charsets Aug 1, 2019
@mike442144
Copy link
Collaborator

set encoding to null to receive a buffer and handle it by yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants