unicode: Improve Windows unicode conversion functions #8109

Smjert · 2023-08-07T15:26:33Z

Add a wstringToString overload that takes a wchar_t pointer and the maximum amount of characters to scan for a null terminator in the provided pointer/buffer, to prevent overflowing.
Add a to_bytes overload, similar to the new wstringToString overload which permits to avoid to construct a std::wstring from a wchar_t pointer, therefore reducing overhead caused by allocations and copies, just to be able to do the conversion to a UTF8 std::string. A difference from the wstringToString overload is that the size is actually the expected size of the input string, not the max size of the buffer.
Changed the to_bytes and from_bytes conversions functions to always call the respective Windows API conversion functions with the precise size of the input string, to avoid the need to scan for the null terminator. When std::wstring and std::string are passed to wstringToString and stringToWstring, they are expected to be already of the correct size, so to terminate exactly just before the null terminator. For the other overloads, wstringToString and stringToWstring will scan the provided buffers for a null terminator.
Reduced overallocation to prepare the biggest needed output due to conversions between UTF8 -> UTF16 and UTF16 -> UTF8. The overallocation was considering sizes given to std::string and std::wstring to be always a factor of bytes, but that's not the case because the size of the strings is in characters.
Added checks to ensure that the strings are not going outside of the size limits imposed by the conversion and the Windows APIs.
User code changed to reflect the previous implementation changes, specifically to call the overload that provides the bounded scan where possible. Done some improvements around how many times conversions were done.
Added some additional tests for the unicode conversion functions to test special cases.

Additional notes:

For the UTF8 -> UTF16 and UTF16 -> UTF8 conversions and size of the output string in respect to the input one,
the original sizes are incorrect because I think it was assuming that the sizes were all in bytes.

First we have to remember that in the UTF8 -> UTF16 conversion, the MultiByteToWideChar conversion function wants the input to be in bytes and in the output in UTF16 chars.
Now in this conversion the worst case/the maximum factor if we consider bytes is 2, because going from 1 byte/character ASCII in UTF8 to UTF16 requires 1 UTF16 character, which is 2 bytes.
But since the output size is counted in UTF16 characters, the conversion factor is actually 1. In all other cases going from UTF8 -> UTF16 reduces the size in bytes or remains the same.
For instance the range U+0800 to U+0FFF requires 3 bytes in UTF8, but in UTF16 it requires 1 character, so 2 bytes.

Conversely in the UTF16 -> UTF8 case, the WideCharToMultiByte conversion function wants the input to be characters and the output to be bytes.
The worst cases are then to be considered when the count of characters has to be multiplied by some number higher than 1, to match the size in UTF8. This is 3, because as we've previously seen, 1 UTF16 character may have to be translated to 3 bytes.
Although 3 bytes is not the biggest UTF8 code point, but 4, these are surrogate pairs which are represented with 4 bytes in both UTF8 and UTF16, or 2 characters in UTF16. This means that going from 2 UTF16 character representing the surrogate pair to the 4 UTF8 bytes the factor is 2.

- Add a wstringToString overload that takes a wchar_t pointer and the maximum amount of characters to scan for a null terminator in the provided pointer/buffer, to prevent overflowing. - Add a to_bytes overload, similar to the new wstringToString overload which permits to avoid to construct a std::wstring from a wchar_t pointer, therefore reducing overhead caused by allocations and copies, just to be able to do the conversion to a UTF8 std::string. A difference from the wstringToString overload is that the size is actually the expected size of the input string, not the max size of the buffer. - Changed the to_bytes and from_bytes conversions functions to always call the respective Windows API conversion functions with the precise size of the input string, to avoid the need to scan for the null terminator. When std::wstring and std::string are passed to wstringToString and stringToWstring, they are expected to be already of the correct size, so to terminate exactly just before the null terminator. For the other overloads, wstringToString and stringToWstring will scan the provided buffers for a null terminator. - Reduced overallocation to prepare the biggest needed output due to conversions between UTF8 -> UTF16 and UTF16 -> UTF8. The overallocation was considering sizes given to std::string and std::wstring to be always a factor of bytes, but that's not the case because the size of the strings is in characters. - Added checks to ensure that the strings are not going outside of the size limits imposed by the conversion and the Windows APIs. - User code changed to reflect the previous implementation changes, specifically to call the overload that provides the bounded scan where possible. Done some improvements around how many times conversions were done. - Added some additional tests for the unicode conversion functions to test special cases.

Smjert · 2023-11-13T20:10:52Z

Note to the readers: I know this is a bit big, I see that's languishing and I'll attempt to see if I can split it. Though some of the changes are coupled and/or would need to be implemented slightly differently to support the intermediate state.

Ideally I would like to split the overallocation improvement from the function interface change (accepting a char* and a length to provide a maximum amount of chars to search in).

directionless

Seems mostly fine. While the PR looks large, it's mostly mechanical. One question about function signatures...

directionless · 2023-11-25T12:16:23Z

osquery/utils/conversions/windows/strings.h

+ * to exist; when it's not found, the size for the string will be max_chars.
+ * @returns A narrow string, constructed from a wide C-string and a size
+ */
+std::string wstringToString(const wchar_t* src, std::size_t max_chars);


Several callers seem to use std:vector for the underlying representation. And in this PR, they end up calling wstringToString(profile.data(), profile.size());. Is it worth making a std:vector wrapper?

directionless · 2024-02-29T12:25:06Z

@Smjert Is this still something you want?

Smjert · 2024-02-29T13:38:51Z

@Smjert Is this still something you want?

Yes! definitely.

I still have to reason a bit around adding that kind of interface (the std::vector one), because while it can be seen as easy to use, it can cause subtle bugs if one isn't careful, vs explicitly asking for the range.
Granted, even with providing the range one can still get it wrong, but there isn't a single interface that makes it usable without potential errors.

directionless · 2024-03-02T02:12:47Z

I still have to reason a bit around adding that kind of interface (the std::vector one), because while it can be seen as easy to use, it can cause subtle bugs if one isn't careful

Maybe we shouldn't add it as a wrapper then?

Smjert added Windows core performance labels Aug 7, 2023

Smjert marked this pull request as ready for review August 8, 2023 15:30

Smjert requested review from a team as code owners August 8, 2023 15:30

sharvilshah requested a review from marcosd4h August 15, 2023 13:03

Smjert mentioned this pull request Sep 25, 2023

file: Add Shortcut metadata parsing on Windows #8143

Merged

Smjert added this to the 5.11.0 milestone Oct 4, 2023

directionless reviewed Nov 25, 2023

View reviewed changes

directionless modified the milestones: 5.11.0, 5.12.0 Dec 27, 2023

directionless modified the milestones: 5.12.0, 5.13 Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode: Improve Windows unicode conversion functions #8109

unicode: Improve Windows unicode conversion functions #8109

Smjert commented Aug 7, 2023 •

edited

Smjert commented Nov 13, 2023

directionless left a comment

directionless Nov 25, 2023

directionless commented Feb 29, 2024

Smjert commented Feb 29, 2024 •

edited

directionless commented Mar 2, 2024

unicode: Improve Windows unicode conversion functions #8109

Are you sure you want to change the base?

unicode: Improve Windows unicode conversion functions #8109

Conversation

Smjert commented Aug 7, 2023 • edited

Additional notes:

Smjert commented Nov 13, 2023

directionless left a comment

Choose a reason for hiding this comment

directionless Nov 25, 2023

Choose a reason for hiding this comment

directionless commented Feb 29, 2024

Smjert commented Feb 29, 2024 • edited

directionless commented Mar 2, 2024

Smjert commented Aug 7, 2023 •

edited

Smjert commented Feb 29, 2024 •

edited