Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML encoding is not taken into account #26

Open
DavidNemeskey opened this issue Jan 14, 2019 · 3 comments
Open

XML encoding is not taken into account #26

DavidNemeskey opened this issue Jan 14, 2019 · 3 comments
Assignees

Comments

@DavidNemeskey
Copy link

While jusText extracts the page encoding for a HTML page correctly from the meta attribute, it does not for XHTML, which uses an XML header:

<?xml version="1.0" encoding="iso-8859-2"?>
@miso-belica
Copy link
Owner

You are right. Do you know any real website using XML serialization of the HTML?

@miso-belica miso-belica self-assigned this Oct 14, 2021
@DavidNemeskey
Copy link
Author

Unfortunately not, it was quite some time ago... I encountered this issue while processing Common Crawl data, but I do understand that having to download & parse a billion pages to find one that uses XHTML is a bit too much to ask 😄

@miso-belica
Copy link
Owner

OK, thank you. I guess it's not that hard to add. Maybe I am wrong but I don't think there are plenty of XHTML documents left out there. We will see if anyone else writes here.

Have a nice day 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants