Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte order marker (BOM) is displayed as empty cell #1095

Open
njhanley opened this issue Apr 30, 2023 · 9 comments
Open

Byte order marker (BOM) is displayed as empty cell #1095

njhanley opened this issue Apr 30, 2023 · 9 comments

Comments

@njhanley
Copy link
Contributor

The byte order marker (BOM) is the use of a zero width no-break space character (U+FEFF) at the start of a file to indicate the encoding byte order in UTF-16/32. While not useful in UTF-8, it is legal and occasionally used as a signature to indicate UTF-8 encoding.

Consider this file: bom.txt
When opened in vis, the BOM is visible as a blank cell when it should be invisible. Interestingly, ZWNBSP is correctly displayed (or rather not displayed) when part of the rest of the file.

https://unicode.org/faq/utf_bom.html#BOM

@mcepl
Copy link
Contributor

mcepl commented May 6, 2023

With reference to https://github.com/martanne/vis/wiki/FAQ#how-should-i-edit-files-in-legacy-encodings I would suggest WONTFIX here. vis (in comparison to vim) doesn’t go into business of dealing with encodings (and CRLF v LF), and it is just plain text editor. If anybody wants to get rid of BOM, there are ways how to do it. Also, if you are dealing with text files originating from that platform, you may well know that dos2unix removes BOM as well.

Yes, BOM in UTF-8 is an abomination of lesser platforms (so called “operating systems”), which punish everybody else for their unfortunate decision to use double-byte encoding for text, UTF-8 doesn’t need BOM, but whole that business should be kept outside of vis in my opinion.

@njhanley
Copy link
Contributor Author

njhanley commented May 6, 2023

The issue isn't that vis should interpret or remove BOMs; it's that a ZWNBSP at the start of a file (a BOM) is currently rendered differently from a ZWNBSP elsewhere in the file. See zwnbsp.txt. The ZWNBSP between 'H' and 'e' is correctly rendered as invisible.

@mcepl
Copy link
Contributor

mcepl commented May 6, 2023

Cannot reproduce here, with vis v0.8-git +curses +lua +tre +acl +selinux I get

screenshot-2023-05-06_22-05-1683406371

@rnpnr
Copy link
Collaborator

rnpnr commented May 6, 2023

That was the point. If you open bom.txt vis consumes the cursor and the window renders incorrectly. In zwnbsp.txt the same bytes are present between h and e but vis correctly renders them as invisible and it doesn't effect the rest of the ui. You will have to use something like od to see the bytes eg: od -t x1 bom.txt

I have noticed this problem before but usually I just press x and delete the character if the file has it at the start because I really don't care about the file being compatible with where it came from.

@njhanley
Copy link
Contributor Author

njhanley commented May 6, 2023

The same behavior can be seen with other zero width characters such as zero-width space (ZWSP) and word joiner (WJ).

zwsp-start.txt vs zwsp-middle.txt
wj-start.txt vs wj-middle.txt

@mcepl
Copy link
Contributor

mcepl commented May 7, 2023

I still believe that the principle matters: all shenanigans with incorrectly encoded files (and yes a file with BOM is incorrectly encoded one) should stay outside of vis and by definition are NOT a vis problem.

@rnpnr
Copy link
Collaborator

rnpnr commented May 7, 2023

I agree with the principle but I also don't like that the ui gets garbled by files like bom.txt. I suspect that its a one or two line fix to stop that from happening. If such a patch is presented I would see no issue with including it.

@mcepl
Copy link
Contributor

mcepl commented May 7, 2023

Sure, if it is so, then I guess, “SHOW ME THE PATCH!”. Also, what should happen with the content of the file? Should BOM should be just hidden but untouched in the file, or should it be really eliminated?

@rnpnr
Copy link
Collaborator

rnpnr commented May 7, 2023

Leave it untouched like what happens when the bytes appear in the middle of the file.

I'll look into it later if I have time but I suspect what is happening is that vis is decrementing the index of where the next character is supposed to be drawn one cell too many when its the first character in the line. Then everything is off by one for rest of window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants