Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpCrawler - determining character encoding #2317

Open
1 task done
barjin opened this issue Feb 2, 2024 · 0 comments
Open
1 task done

HttpCrawler - determining character encoding #2317

barjin opened this issue Feb 2, 2024 · 0 comments
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Contributor

barjin commented Feb 2, 2024

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

The HTML living standard defines steps for determining the HTML document's character encoding.

The HttpCrawler (and transitively, CheerioCrawler) only uses the HTTP Content-Encoding header to determine the encoding - with a possible suggestResponseEncoding option. This breaks (most notably) the parsing of websites, which use the <meta http-equiv=Content-Type elements for determining the encoding. The HTML standard solves this with the byte stream prescan.

Previously reported in #524 and this WCC issue.

Code sample

import { CheerioCrawler } from "@crawlee/cheerio";

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, body, response, request }) => {
        console.log(body);
    },
});

(async () => {
    await crawler.run([
        'http://finance.ce.cn/stock/gsgdbd/202207/01/t20220701_37824007.shtml'
        // other webpages with this issue:
        // 'https://www.imot.bg/pcgi/imot.cgi?act=5&adv=2b157484078874523&slink=51kk4i&f1=1'
        // 'http://www.karlin.mff.cuni.cz/~antoch/'
    ]);
})();

Package version

3.7.3

Node.js version

Node.js 16, 18, 20

Operating system

OS agnostic

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@barjin barjin added the bug Something isn't working. label Feb 2, 2024
@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants