New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add hascodepoint(c::AbstractChar) and use it #54393
base: master
Are you sure you want to change the base?
Conversation
I would like more documentation of the distinction between |
Here are two examples where
None of this behavior is new in this PR, and better documentation of |
…ring of public hascodepoint
It feels a little messy that we don't allow getting the code point of an overlong encoding, but I'm sure we had a discussion about that and decided it was too dangerous. It's tempting to make this a keyword modification to |
I think the thing that bothers me is that I think of potential string data falling into three broad classes:
Because of the decision to have |
@StefanKarpinski, so one alternative would be to have I'm fine with this, though I'm not sure what the other implications would be. It doesn't seem like it would be too breaking since it would take a case that currently throws and make it non-throwing. |
If the encoding is still unique, I like normalizing in Of course, if the overlong encoding is not unique (this should never be the case, right? Any overlong |
Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>
To play devil's advocate to myself, I think the worry about overlong encodings is that someone might sneak a code point that's not supposed to get through some filter and then later cause trouble. The classic example is using To try to steelman this, I'm imagining someone writing code that doesn't call |
I'm leaning towards it's fine for |
I don't know about dangerous, but that should be fixable too, no? We'll just have to make One caveat to that is that even with comparing codepoints in |
I don't like that approach, since it will mean |
We have a lot of character predicates ( |
I agree with @stevengj. Having |
My reading of that is that there would still be a distinction between whether the two strings encode the same sequence of (our) Char object which handles those distinctly or the same sequence of (Unicode) characters (converted to UInt32), which may consider them to be the same or error on some of them (and not even looking at normalizations at the Unicode level) |
Yeah, that's certainly a bad tradeoff. Let's just stick to
Do we? IIRC we currently compare by memory contents, not encoded codepoints: julia> "\xc0\x80" == "\0"
false |
@StefanKarpinski, do you have an opinion regarding my question about character predicates above? e.g. should My inclination is that they should all return How should overlong characters be displayed? Should they show the codepoint? Currently they display without showing the codepoint: julia> '\xc0\xa0'
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space) (I'm basically a bit confused about what people are supposed to do with overlong characters that justifies returning a |
I do feel that having code point predicates like
I can see a case for making
Isn't that the code point show in there in the description? |
This PR adds and exports a new function
hascodepoint(c::AbstractChar)
which returns whether or notcodepoint(c)
will succeed.It then uses this function before several calls to
codepoint(c)
orUInt32(c)
instead of!ismalformed(c)
, which was incorrect because it didn't handle theisoverlong(c)
case. Fixes #54343.(Potentially we could also have an
unsafe_codepoint(c)
function that assumeshascodepoint(c) == true
in order to optimize conversions in cases wherehascodepoint(c)
has already been checked.)