Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode: Improve Windows unicode conversion functions #8109

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Smjert
Copy link
Member

@Smjert Smjert commented Aug 7, 2023

  • Add a wstringToString overload that takes a wchar_t pointer and the maximum amount of characters to scan for a null terminator in the provided pointer/buffer, to prevent overflowing.

  • Add a to_bytes overload, similar to the new wstringToString overload which permits to avoid to construct a std::wstring from a wchar_t pointer, therefore reducing overhead caused by allocations and copies, just to be able to do the conversion to a UTF8 std::string. A difference from the wstringToString overload is that the size is actually the expected size of the input string, not the max size of the buffer.

  • Changed the to_bytes and from_bytes conversions functions to always call the respective Windows API conversion functions with the precise size of the input string, to avoid the need to scan for the null terminator. When std::wstring and std::string are passed to wstringToString and stringToWstring, they are expected to be already of the correct size, so to terminate exactly just before the null terminator. For the other overloads, wstringToString and stringToWstring will scan the provided buffers for a null terminator.

  • Reduced overallocation to prepare the biggest needed output due to conversions between UTF8 -> UTF16 and UTF16 -> UTF8. The overallocation was considering sizes given to std::string and std::wstring to be always a factor of bytes, but that's not the case because the size of the strings is in characters.

  • Added checks to ensure that the strings are not going outside of the size limits imposed by the conversion and the Windows APIs.

  • User code changed to reflect the previous implementation changes, specifically to call the overload that provides the bounded scan where possible. Done some improvements around how many times conversions were done.

  • Added some additional tests for the unicode conversion functions to test special cases.

Additional notes:

For the UTF8 -> UTF16 and UTF16 -> UTF8 conversions and size of the output string in respect to the input one,
the original sizes are incorrect because I think it was assuming that the sizes were all in bytes.

First we have to remember that in the UTF8 -> UTF16 conversion, the MultiByteToWideChar conversion function wants the input to be in bytes and in the output in UTF16 chars.
Now in this conversion the worst case/the maximum factor if we consider bytes is 2, because going from 1 byte/character ASCII in UTF8 to UTF16 requires 1 UTF16 character, which is 2 bytes.
But since the output size is counted in UTF16 characters, the conversion factor is actually 1. In all other cases going from UTF8 -> UTF16 reduces the size in bytes or remains the same.
For instance the range U+0800 to U+0FFF requires 3 bytes in UTF8, but in UTF16 it requires 1 character, so 2 bytes.

Conversely in the UTF16 -> UTF8 case, the WideCharToMultiByte conversion function wants the input to be characters and the output to be bytes.
The worst cases are then to be considered when the count of characters has to be multiplied by some number higher than 1, to match the size in UTF8. This is 3, because as we've previously seen, 1 UTF16 character may have to be translated to 3 bytes.
Although 3 bytes is not the biggest UTF8 code point, but 4, these are surrogate pairs which are represented with 4 bytes in both UTF8 and UTF16, or 2 characters in UTF16. This means that going from 2 UTF16 character representing the surrogate pair to the 4 UTF8 bytes the factor is 2.

- Add a wstringToString overload that takes a wchar_t pointer
  and the maximum amount of characters to scan for a null terminator
  in the provided pointer/buffer, to prevent overflowing.

- Add a to_bytes overload, similar to the new wstringToString overload
  which permits to avoid to construct a std::wstring from a wchar_t pointer,
  therefore reducing overhead caused by allocations and copies,
  just to be able to do the conversion to a UTF8 std::string.
  A difference from the wstringToString overload is that the size
  is actually the expected size of the input string, not the max size
  of the buffer.

- Changed the to_bytes and from_bytes conversions functions
  to always call the respective Windows API conversion functions
  with the precise size of the input string, to avoid the need
  to scan for the null terminator.
  When std::wstring and std::string are passed to wstringToString
  and stringToWstring, they are expected to be already of the
  correct size, so to terminate exactly just before the null terminator.
  For the other overloads, wstringToString and stringToWstring
  will scan the provided buffers for a null terminator.

- Reduced overallocation to prepare the biggest needed output
  due to conversions between UTF8 -> UTF16 and UTF16 -> UTF8.
  The overallocation was considering sizes given to std::string
  and std::wstring to be always a factor of bytes,
  but that's not the case because the size of the strings is in characters.

- Added checks to ensure that the strings are not going outside
  of the size limits imposed by the conversion and the Windows APIs.

- User code changed to reflect the previous implementation changes,
  specifically to call the overload that provides the bounded scan
  where possible.
  Done some improvements around how many times conversions were done.

- Added some additional tests for the unicode conversion functions
  to test special cases.
@Smjert Smjert marked this pull request as ready for review August 8, 2023 15:30
@Smjert Smjert requested review from a team as code owners August 8, 2023 15:30
@Smjert Smjert added this to the 5.11.0 milestone Oct 4, 2023
@Smjert
Copy link
Member Author

Smjert commented Nov 13, 2023

Note to the readers: I know this is a bit big, I see that's languishing and I'll attempt to see if I can split it. Though some of the changes are coupled and/or would need to be implemented slightly differently to support the intermediate state.

Ideally I would like to split the overallocation improvement from the function interface change (accepting a char* and a length to provide a maximum amount of chars to search in).

Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems mostly fine. While the PR looks large, it's mostly mechanical. One question about function signatures...

* to exist; when it's not found, the size for the string will be max_chars.
* @returns A narrow string, constructed from a wide C-string and a size
*/
std::string wstringToString(const wchar_t* src, std::size_t max_chars);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several callers seem to use std:vector for the underlying representation. And in this PR, they end up calling wstringToString(profile.data(), profile.size());. Is it worth making a std:vector wrapper?

@directionless directionless modified the milestones: 5.11.0, 5.12.0 Dec 27, 2023
@directionless
Copy link
Member

@Smjert Is this still something you want?

@Smjert
Copy link
Member Author

Smjert commented Feb 29, 2024

@Smjert Is this still something you want?

Yes! definitely.

I still have to reason a bit around adding that kind of interface (the std::vector one), because while it can be seen as easy to use, it can cause subtle bugs if one isn't careful, vs explicitly asking for the range.
Granted, even with providing the range one can still get it wrong, but there isn't a single interface that makes it usable without potential errors.

@directionless
Copy link
Member

I still have to reason a bit around adding that kind of interface (the std::vector one), because while it can be seen as easy to use, it can cause subtle bugs if one isn't careful

Maybe we shouldn't add it as a wrapper then?

@directionless directionless modified the milestones: 5.12.0, 5.13 Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants