Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta.parse(repr(Char(0x110000))) fails #54396

Open
stevengj opened this issue May 8, 2024 · 3 comments
Open

Meta.parse(repr(Char(0x110000))) fails #54396

stevengj opened this issue May 8, 2024 · 3 comments
Labels
domain:display and printing Aesthetics and correctness of printed representations of objects. domain:unicode Related to unicode characters and encodings parser Language parsing and surface syntax

Comments

@stevengj
Copy link
Member

stevengj commented May 8, 2024

Meta.parse(repr(Char(0x110000))) fails because

julia> show(Char(0x110000))
'\U110000'

but '\U110000' is not parseable:

julia> '\U110000'
ERROR: ParseError:
# Error @ REPL[17]:1:2
'\U110000'
#└──────┘ ── invalid unicode escape sequence

isvalid(Char(0x110000)) is false, but other invalid characters are parsed okay:

julia> '\ud800'
'\ud800': Unicode U+D800 (category Cs: Other, surrogate)

julia> isvalid('\ud800')
false

so this seems kind of inconsistent.

Options are either (a) change the printing of Char(0x110000) or (b) change the parsing to allow this. I lean towards (a). Thoughts?

@stevengj stevengj added domain:unicode Related to unicode characters and encodings parser Language parsing and surface syntax domain:display and printing Aesthetics and correctness of printed representations of objects. labels May 8, 2024
@Seelengrab
Copy link
Contributor

Seelengrab commented May 8, 2024

I think this is a bug in the parser. What would the printing be changed to to make it parse? Just using u doesn't work because then the literal is too large:

julia> '\u11000'
ERROR: ParseError:
# Error @ REPL[27]:1:2
'\u11000'
#└─────┘ ── character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ REPL:1

@stevengj
Copy link
Member Author

stevengj commented May 8, 2024

The printing could be changed to '\xf4\x90\x80\x80', by calling Base.show_invalid, for example. ('\U110000' is a lot more understandable, but is meaningless from the perspective of Unicode.)

It could also print as Char(0x110000), but that's a pretty radical change from how other characters are printed.

If we extend the parser to allow this, I guess we would parse up to '\U1fffff', since Char(0x200000) throws an error. That seems reasonable to me, since there is still a clear upper bound on what we should parse.

@Seelengrab
Copy link
Contributor

The manual has that exact value as an example, and documents that up to the following 8 bytes are allowed for \U, so I'd be in favor of fixing the parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:display and printing Aesthetics and correctness of printed representations of objects. domain:unicode Related to unicode characters and encodings parser Language parsing and surface syntax
Projects
None yet
Development

No branches or pull requests

2 participants