You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Issue description
The HTML living standard defines steps for determining the HTML document's character encoding.
The HttpCrawler (and transitively, CheerioCrawler) only uses the HTTP Content-Encoding header to determine the encoding - with a possible suggestResponseEncoding option. This breaks (most notably) the parsing of websites, which use the <meta http-equiv=Content-Type elements for determining the encoding. The HTML standard solves this with the byte stream prescan.
import{CheerioCrawler}from"@crawlee/cheerio";constcrawler=newCheerioCrawler({requestHandler: async({ $, body, response, request })=>{console.log(body);},});(async()=>{awaitcrawler.run(['http://finance.ce.cn/stock/gsgdbd/202207/01/t20220701_37824007.shtml'// other webpages with this issue:// 'https://www.imot.bg/pcgi/imot.cgi?act=5&adv=2b157484078874523&slink=51kk4i&f1=1'// 'http://www.karlin.mff.cuni.cz/~antoch/']);})();
Package version
3.7.3
Node.js version
Node.js 16, 18, 20
Operating system
OS agnostic
Apify platform
Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
The text was updated successfully, but these errors were encountered:
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Issue description
The HTML living standard defines steps for determining the HTML document's character encoding.
The
HttpCrawler
(and transitively,CheerioCrawler
) only uses the HTTPContent-Encoding
header to determine the encoding - with a possiblesuggestResponseEncoding
option. This breaks (most notably) the parsing of websites, which use the<meta http-equiv=Content-Type
elements for determining the encoding. The HTML standard solves this with the byte stream prescan.Previously reported in #524 and this WCC issue.
Code sample
Package version
3.7.3
Node.js version
Node.js 16, 18, 20
Operating system
OS agnostic
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: