XML encoding is not taken into account #26

DavidNemeskey · 2019-01-14T11:44:06Z

While jusText extracts the page encoding for a HTML page correctly from the meta attribute, it does not for XHTML, which uses an XML header:

<?xml version="1.0" encoding="iso-8859-2"?>

The text was updated successfully, but these errors were encountered:

miso-belica · 2021-10-14T15:42:15Z

You are right. Do you know any real website using XML serialization of the HTML?

DavidNemeskey · 2021-10-14T21:37:34Z

Unfortunately not, it was quite some time ago... I encountered this issue while processing Common Crawl data, but I do understand that having to download & parse a billion pages to find one that uses XHTML is a bit too much to ask 😄

miso-belica · 2021-10-15T08:46:45Z

OK, thank you. I guess it's not that hard to add. Maybe I am wrong but I don't think there are plenty of XHTML documents left out there. We will see if anyone else writes here.

Have a nice day 🙂

miso-belica self-assigned this Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML encoding is not taken into account #26

XML encoding is not taken into account #26

DavidNemeskey commented Jan 14, 2019

miso-belica commented Oct 14, 2021

DavidNemeskey commented Oct 14, 2021

miso-belica commented Oct 15, 2021

XML encoding is not taken into account #26

XML encoding is not taken into account #26

Comments

DavidNemeskey commented Jan 14, 2019

miso-belica commented Oct 14, 2021

DavidNemeskey commented Oct 14, 2021

miso-belica commented Oct 15, 2021