add hascodepoint(c::AbstractChar) and use it #54393

stevengj · 2024-05-07T17:53:05Z

This PR adds and exports a new function hascodepoint(c::AbstractChar) which returns whether or not codepoint(c) will succeed.

It then uses this function before several calls to codepoint(c) or UInt32(c) instead of !ismalformed(c), which was incorrect because it didn't handle the isoverlong(c) case. Fixes #54343.

(Potentially we could also have an unsafe_codepoint(c) function that assumes hascodepoint(c) == true in order to optimize conversions in cases where hascodepoint(c) has already been checked.)

jariji · 2024-05-07T18:23:17Z

I would like more documentation of the distinction between isvalid and hascodepoint. What is a "valid" char if not "hascodepoint"?

stevengj · 2024-05-07T18:47:28Z

Here are two examples where hascodepoint(c) returns true (and codepoint(c) succeeds), but for which isvalid(c) returns false:

c = '\U110000' has a codepoint 0x00110000 but is an invalid Unicode character because codepoints are required by the Unicode standard to be < 0x00110000, now and forever, in order to ensure that codepoints can be represented in UTF-16.

c = '\ud800' has a codepoint U+D800 but is an invalid Unicode character because it is reserved for use in the UTF-16 encoding as part of a surrogate pair.

None of this behavior is new in this PR, and better documentation of isvalid should be a separate issue, but I added some information on this to the hascodepoint docs.

…ring of public hascodepoint

StefanKarpinski · 2024-05-07T20:52:46Z

It feels a little messy that we don't allow getting the code point of an overlong encoding, but I'm sure we had a discussion about that and decided it was too dangerous. It's tempting to make this a keyword modification to isvalid but given that we already have a variety of these predicates, perhaps this one is also good to have.

StefanKarpinski · 2024-05-07T21:14:54Z

I think the thing that bothers me is that I think of potential string data falling into three broad classes:

Valid. String data that is fully kosher: encoding is sensible and code points that are encoded are good.
Well-formed but invalid. Respects the encoding structure, but what is encoded is no good. For some encodings, like Latin-1 this doesn't exist—every possible sequence of bytes is valid Latin-1. For UTF-16 this is only unpaired surrogates. For UTF-8 this includes: overlong encodings, surrogate code points (since they're reserved for use in UTF-16 and should never appear in UTF-8), and too-high code points.
Malformed. Doesn't even respect the general structure of the encoding, it's just junk. For some encodings like Latin-1 or UTF-16 this doesn't exist because any possible sequence of code units encodes a code point sequence. For UTF-8 there are a few ways this can happen: a) byte with 5+ leading bits, b) unexpected continuation byte, c) lead byte followed by too few continutation bytes.

Because of the decision to have codepoint(c) return an error for overlong encodings, hascodepoint(c) cuts across these classes in a messy way: it returns true for all of class 1 (valid) and false for all of class 3 (malformed), but it's a mix for class 2 (well-formed but invalid)—false for overlong encodings, true for the rest. Maybe this is fine, but that's where my misgivings come from.

NEWS.md

stevengj · 2024-05-08T00:56:22Z

@StefanKarpinski, so one alternative would be to have codepoint(c) return a value for overlong encodings. In that case we can remove hascodepoint and should probably export ismalformed and document that codepoint should succeed for any non-malformed c.

I'm fine with this, though I'm not sure what the other implications would be. It doesn't seem like it would be too breaking since it would take a case that currently throws and make it non-throwing.

test/char.jl

Seelengrab · 2024-05-08T05:57:03Z

one alternative would be to have codepoint(c) return a value for overlong encodings. In that case we can remove hascodepoint and should probably export ismalformed and document that codepoint should succeed for any non-malformed c.

If the encoding is still unique, I like normalizing in codepoint. It's IMO better to silently fix potentially broken code that could unambiguously do the right thing than force modification/fixes by adding hascodepoint.

Of course, if the overlong encoding is not unique (this should never be the case, right? Any overlong Char must map to one and only one codepoint, right?), we'll still have to throw either way.

Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>

StefanKarpinski · 2024-05-08T13:55:17Z

To play devil's advocate to myself, I think the worry about overlong encodings is that someone might sneak a code point that's not supposed to get through some filter and then later cause trouble. The classic example is using '\xc0\x80' to sneak code point U+0000 into null-terminated strings. This is done on purpose in Modified UTF-8, for example.

To try to steelman this, I'm imagining someone writing code that doesn't call isvalid and examines each character's code point by calling codepoint(c) and they assume that if the string is invalid, one of those calls will error. However, that would already not be the case since codepoint('\Ud800') and codepoint('\xf4\x90\x80\x80') both return values rather than erroring. Currently, the only kind of invalidity that approach would catch is overlong encodings. I'm not at all clear on why overlong encodings are uniquely dangerous. It seems find to me to have isvalid('\xc0\x80') == false and codepoint('\xc0\x80') == 0. In that case we would simply have ismalformed as the predicate that distinguishes whether something encodes a code point or not. The only case I can see where there would be a problem is code that checks for a specific character and then later assumes that character's code point cannot occur and that assumption being false due to overlong encodings. Is that dangerous enough to really warrant making everything messier?

StefanKarpinski · 2024-05-08T14:03:04Z

I'm leaning towards it's fine for codepoint(overlong) to return the encoded code point instead of erroring, and then we can just export and document ismalformed and use !ismalformed instead of hascodepoint.

Seelengrab · 2024-05-08T14:43:28Z

The only case I can see where there would be a problem is code that checks for a specific character and then later assumes that character's code point cannot occur and that assumption being false due to overlong encodings. Is that dangerous enough to really warrant making everything messier?

I don't know about dangerous, but that should be fixable too, no? We'll just have to make ==(::Char, ::Char) compare the codepoints instead of the bitwise representations, as is done now. That would make '\0' in my_str return true even if the null byte is encoded as overlong. Might be considered slightly breaking though, since people relying on that would have to switch to ===.

One caveat to that is that even with comparing codepoints in ==, if someone then looks at the raw bytes instead of the Char-iterator abstraction, they may be confused if they don't know about overlong encodings.

stevengj · 2024-05-08T15:16:22Z

We'll just have to make ==(::Char, ::Char) compare the codepoints instead of the bitwise representations, as is done now

I don't like that approach, since it will mean string1 == string2 is no longer equivalent to all(collect(string1) .== collect(string2)). (It will also slow down character comparisons.)

stevengj · 2024-05-08T16:28:45Z

We have a lot of character predicates (islower, category codes, etc) that call utf8proc with the code point if the character is not malformed. What should these do with overlong encodings of valid codepoints?

StefanKarpinski · 2024-05-08T19:19:38Z

I agree with @stevengj. Having regular::Char == overlong::Char seems very iffy to me. I do see the point that right now we have the property that two strings are equal if they encode the same sequence of code points.

vtjnash · 2024-05-08T19:41:31Z

My reading of that is that there would still be a distinction between whether the two strings encode the same sequence of (our) Char object which handles those distinctly or the same sequence of (Unicode) characters (converted to UInt32), which may consider them to be the same or error on some of them (and not even looking at normalizations at the Unicode level)

Seelengrab · 2024-05-08T19:41:35Z

it will mean string1 == string2 is no longer equivalent to all(collect(string1) .== collect(string2)). (It will also slow down character comparisons.)

Yeah, that's certainly a bad tradeoff. Let's just stick to codepoint normalizing to non-overlong encoding then.

I do see the point that right now we have the property that two strings are equal if they encode the same sequence of code points.

Do we? IIRC we currently compare by memory contents, not encoded codepoints:

julia> "\xc0\x80" == "\0"
false

stevengj · 2024-05-09T19:51:43Z

@StefanKarpinski, do you have an opinion regarding my question about character predicates above?

e.g. should isspace('\xc0\xa0') == true, since this is an overlong encoding of ' ' == '\u0020'? Currently it returns false, but some other predicates throw an exception, like islowercase('\xc0\xa0')

My inclination is that they should all return false, i.e. they should check ismalformed(c) | isoverlong(c), or simply isvalid(c) (though this does some extra checks that are redundant with utf8proc), before passing the codepoint to utf8proc.

How should overlong characters be displayed? Should they show the codepoint? Currently they display without showing the codepoint:

julia> '\xc0\xa0'
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)

(I'm basically a bit confused about what people are supposed to do with overlong characters that justifies returning a codepoint for them.)

StefanKarpinski · 2024-05-23T14:04:08Z

I do feel that having code point predicates like isspace or islowercase return true for overlong encodings is iffy. My gut inclination leans towards throwing an error if the argument is not valid Unicode. There are two reasons in my mind for allowing codepoint(c) for well-formed but invalid characters:

codepoint(c) already works for other invalid but well-formed character like surrogates and high code points. What makes overlong encodings so different from these?
If we don't allow codepoint(c) for well-formed but invalid characters, we still should provide some way to get the code point. Doesn't it make the most sense for that way to be codepoint(c)?

I can see a case for making codepoint(c) strict by default but having codepoint(c, strict=false) which will decode anything that is not malformed(c), but in that case the strict version should imo also error on surrogates and too-high code points.

How should overlong characters be displayed? Should they show the codepoint? Currently they display without showing the codepoint:
julia> '\xc0\xa0'
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)

Isn't that the code point show in there in the description?

add hascodepoint(c::AbstractChar) and use it

e1c78c0

stevengj added domain:unicode Related to unicode characters and encodings needs tests Unit tests are required for this change needs news A NEWS entry is required for this change labels May 7, 2024

some more cases

81db1cc

explain distinction from isvalid

30e6022

JeffBezanson approved these changes May 7, 2024

View reviewed changes

stevengj added 4 commits May 7, 2024 16:21

add hascodepoint to manual

0759fe0

ismalformed and isoverlong are not public, don't reference from docst…

fa2f9e5

…ring of public hascodepoint

add NEWS

88f9d8c

more tests

a985566

stevengj removed needs tests Unit tests are required for this change needs news A NEWS entry is required for this change labels May 7, 2024

stevengj added 2 commits May 7, 2024 16:34

fix docstring escape

8e235aa

test fix

2821574

topolarity reviewed May 7, 2024

View reviewed changes

NEWS.md Show resolved Hide resolved

test fix

03a6d21

stevengj added the status:DO NOT MERGE Do not merge this PR! label May 8, 2024

Seelengrab reviewed May 8, 2024

View reviewed changes

test/char.jl Show resolved Hide resolved

Update test/char.jl

7e1dba8

Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add hascodepoint(c::AbstractChar) and use it #54393

add hascodepoint(c::AbstractChar) and use it #54393

stevengj commented May 7, 2024 •

edited

jariji commented May 7, 2024

stevengj commented May 7, 2024 •

edited

StefanKarpinski commented May 7, 2024

StefanKarpinski commented May 7, 2024 •

edited

stevengj commented May 8, 2024

Seelengrab commented May 8, 2024

StefanKarpinski commented May 8, 2024

StefanKarpinski commented May 8, 2024

Seelengrab commented May 8, 2024 •

edited

stevengj commented May 8, 2024

stevengj commented May 8, 2024

StefanKarpinski commented May 8, 2024

vtjnash commented May 8, 2024

Seelengrab commented May 8, 2024 •

edited

stevengj commented May 9, 2024 •

edited

StefanKarpinski commented May 23, 2024

add hascodepoint(c::AbstractChar) and use it #54393

Are you sure you want to change the base?

add hascodepoint(c::AbstractChar) and use it #54393

Conversation

stevengj commented May 7, 2024 • edited

jariji commented May 7, 2024

stevengj commented May 7, 2024 • edited

StefanKarpinski commented May 7, 2024

StefanKarpinski commented May 7, 2024 • edited

stevengj commented May 8, 2024

Seelengrab commented May 8, 2024

StefanKarpinski commented May 8, 2024

StefanKarpinski commented May 8, 2024

Seelengrab commented May 8, 2024 • edited

stevengj commented May 8, 2024

stevengj commented May 8, 2024

StefanKarpinski commented May 8, 2024

vtjnash commented May 8, 2024

Seelengrab commented May 8, 2024 • edited

stevengj commented May 9, 2024 • edited

StefanKarpinski commented May 23, 2024

stevengj commented May 7, 2024 •

edited

stevengj commented May 7, 2024 •

edited

StefanKarpinski commented May 7, 2024 •

edited

Seelengrab commented May 8, 2024 •

edited

Seelengrab commented May 8, 2024 •

edited

stevengj commented May 9, 2024 •

edited