New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode: Improve Windows unicode conversion functions #8109
base: master
Are you sure you want to change the base?
unicode: Improve Windows unicode conversion functions #8109
Conversation
- Add a wstringToString overload that takes a wchar_t pointer and the maximum amount of characters to scan for a null terminator in the provided pointer/buffer, to prevent overflowing. - Add a to_bytes overload, similar to the new wstringToString overload which permits to avoid to construct a std::wstring from a wchar_t pointer, therefore reducing overhead caused by allocations and copies, just to be able to do the conversion to a UTF8 std::string. A difference from the wstringToString overload is that the size is actually the expected size of the input string, not the max size of the buffer. - Changed the to_bytes and from_bytes conversions functions to always call the respective Windows API conversion functions with the precise size of the input string, to avoid the need to scan for the null terminator. When std::wstring and std::string are passed to wstringToString and stringToWstring, they are expected to be already of the correct size, so to terminate exactly just before the null terminator. For the other overloads, wstringToString and stringToWstring will scan the provided buffers for a null terminator. - Reduced overallocation to prepare the biggest needed output due to conversions between UTF8 -> UTF16 and UTF16 -> UTF8. The overallocation was considering sizes given to std::string and std::wstring to be always a factor of bytes, but that's not the case because the size of the strings is in characters. - Added checks to ensure that the strings are not going outside of the size limits imposed by the conversion and the Windows APIs. - User code changed to reflect the previous implementation changes, specifically to call the overload that provides the bounded scan where possible. Done some improvements around how many times conversions were done. - Added some additional tests for the unicode conversion functions to test special cases.
Note to the readers: I know this is a bit big, I see that's languishing and I'll attempt to see if I can split it. Though some of the changes are coupled and/or would need to be implemented slightly differently to support the intermediate state. Ideally I would like to split the overallocation improvement from the function interface change (accepting a char* and a length to provide a maximum amount of chars to search in). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems mostly fine. While the PR looks large, it's mostly mechanical. One question about function signatures...
* to exist; when it's not found, the size for the string will be max_chars. | ||
* @returns A narrow string, constructed from a wide C-string and a size | ||
*/ | ||
std::string wstringToString(const wchar_t* src, std::size_t max_chars); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several callers seem to use std:vector
for the underlying representation. And in this PR, they end up calling wstringToString(profile.data(), profile.size());
. Is it worth making a std:vector
wrapper?
@Smjert Is this still something you want? |
Yes! definitely. I still have to reason a bit around adding that kind of interface (the |
Maybe we shouldn't add it as a wrapper then? |
Add a wstringToString overload that takes a wchar_t pointer and the maximum amount of characters to scan for a null terminator in the provided pointer/buffer, to prevent overflowing.
Add a to_bytes overload, similar to the new wstringToString overload which permits to avoid to construct a std::wstring from a wchar_t pointer, therefore reducing overhead caused by allocations and copies, just to be able to do the conversion to a UTF8 std::string. A difference from the wstringToString overload is that the size is actually the expected size of the input string, not the max size of the buffer.
Changed the to_bytes and from_bytes conversions functions to always call the respective Windows API conversion functions with the precise size of the input string, to avoid the need to scan for the null terminator. When std::wstring and std::string are passed to wstringToString and stringToWstring, they are expected to be already of the correct size, so to terminate exactly just before the null terminator. For the other overloads, wstringToString and stringToWstring will scan the provided buffers for a null terminator.
Reduced overallocation to prepare the biggest needed output due to conversions between UTF8 -> UTF16 and UTF16 -> UTF8. The overallocation was considering sizes given to std::string and std::wstring to be always a factor of bytes, but that's not the case because the size of the strings is in characters.
Added checks to ensure that the strings are not going outside of the size limits imposed by the conversion and the Windows APIs.
User code changed to reflect the previous implementation changes, specifically to call the overload that provides the bounded scan where possible. Done some improvements around how many times conversions were done.
Added some additional tests for the unicode conversion functions to test special cases.
Additional notes:
For the UTF8 -> UTF16 and UTF16 -> UTF8 conversions and size of the output string in respect to the input one,
the original sizes are incorrect because I think it was assuming that the sizes were all in bytes.
First we have to remember that in the UTF8 -> UTF16 conversion, the
MultiByteToWideChar
conversion function wants the input to be in bytes and in the output in UTF16 chars.Now in this conversion the worst case/the maximum factor if we consider bytes is 2, because going from 1 byte/character ASCII in UTF8 to UTF16 requires 1 UTF16 character, which is 2 bytes.
But since the output size is counted in UTF16 characters, the conversion factor is actually 1. In all other cases going from UTF8 -> UTF16 reduces the size in bytes or remains the same.
For instance the range U+0800 to U+0FFF requires 3 bytes in UTF8, but in UTF16 it requires 1 character, so 2 bytes.
Conversely in the UTF16 -> UTF8 case, the
WideCharToMultiByte
conversion function wants the input to be characters and the output to be bytes.The worst cases are then to be considered when the count of characters has to be multiplied by some number higher than 1, to match the size in UTF8. This is 3, because as we've previously seen, 1 UTF16 character may have to be translated to 3 bytes.
Although 3 bytes is not the biggest UTF8 code point, but 4, these are surrogate pairs which are represented with 4 bytes in both UTF8 and UTF16, or 2 characters in UTF16. This means that going from 2 UTF16 character representing the surrogate pair to the 4 UTF8 bytes the factor is 2.