Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion problems with fs::path on Windows #5638

Merged
merged 4 commits into from Apr 30, 2024

Conversation

panpaul
Copy link
Contributor

@panpaul panpaul commented Apr 25, 2024

(I'm testing on MS Windows and my system language is set to Chinese)

I have observed these problems:

  1. Locales will not load when installed to a path with non-ASCII characters.
  2. Saved annotated PDF files (*.xopp) could not be reopened when the PDF file-path contains non-ASCII characters.
  3. If there are non-ASCII characters in the "default saved name" template, the name generated in the saving dialog will be garbled.

These problems are mainly caused by the conversion between UTF-16 and UTF-8, fixed by using .u8string() and fs::u8path()

@bhennion bhennion linked an issue Apr 25, 2024 that may be closed by this pull request
@bhennion

This comment was marked as off-topic.

@bhennion bhennion linked an issue Apr 25, 2024 that may be closed by this pull request
@Febbe

This comment has been minimized.

@@ -85,7 +85,7 @@ void initCAndCoutLocales() {
void initLocalisation() {
#ifdef ENABLE_NLS
fs::path localeDir = Util::getGettextFilepath(Util::getLocalePath().u8string().c_str());
bindtextdomain(GETTEXT_PACKAGE, localeDir.u8string().c_str());
bindtextdomain(GETTEXT_PACKAGE, localeDir.string().c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you know that bindtextdomain requires system narrow encoded strings instead of utf-8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

microsoft/vcpkg#9854

We must be careful here, vcpkg fixed this for gettext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you know that bindtextdomain requires system narrow encoded strings instead of utf-8?

I took a quick look at the source code of gettext, and it seems that gettext will open those message files by calling _open or _wopen. At least on MSVC, these functions do not accept UTF-8.

Maybe we should use wbindtextdomain, as suggested by the documentation.

microsoft/vcpkg#9854

We must be careful here, vcpkg fixed this for gettext.

I wasn't even aware of this patch. Thanks for pointing this out. I'll check it later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use wbindtextdomain, as suggested by the documentation.

I think we are with that on the secure side. The patch does not seem to be applied upstream, only in vcpkg.

@Febbe
Copy link
Collaborator

Febbe commented Apr 25, 2024

I have the bad feeling, that this patch will break old xopp files with paths containing non ASCII characters, since they stored paths as ANSI/system encoded. Now those paths are assumed to be encoded in utf-8.

@Febbe

This comment was marked as off-topic.

src/util/PathUtil.cpp Outdated Show resolved Hide resolved
src/core/model/Document.cpp Outdated Show resolved Hide resolved
@panpaul
Copy link
Contributor Author

panpaul commented Apr 26, 2024

I have the bad feeling, that this patch will break old xopp files with paths containing non ASCII characters, since they stored paths as ANSI/system encoded. Now those paths are assumed to be encoded in utf-8.

I made two modifications to xopp file scheme

  1. For the PDF filename: encode the filename in UTF-8
  2. For the background image filename: encode and decode the filename in UTF-8

In the original LoadHandler: pdf filename is parsed as UTF-8 string here, so I modified SaveHandler to align with this behaviour.

Regarding the background image filename, yes, it may break compatibility with old xopp file. I'll revert those background image related changes to maintain compatibility.

@Febbe
Copy link
Collaborator

Febbe commented Apr 26, 2024

In the original LoadHandler: pdf filename is parsed as UTF-8 string here, so I modified SaveHandler to align with this behaviour.

Loading of the system file encoding would be better, since old files are already stored with it.

Regarding the background image filename, yes, it may break compatibility with old xopp file. I'll revert those background image related changes to maintain compatibility.

Maybe, we can find a solution to use utf-8 strings in general, e.g. by advancing the file version. We would like to have them in utf8 anyway. That would improve portability of xopp files, and keep old files compatible.

@Febbe
Copy link
Collaborator

Febbe commented Apr 26, 2024

@panpaul you can increase the DEV_FILE_FORMAT_VERSION in

set(DEV_FILE_FORMAT_VERSION 4 CACHE STRING "File format version" FORCE)
from 4 to 5 and

  • patch the LoadHandler to use fs::path for version less than 5
  • patch the Load- and SaveHandler to use u8string for versions greater 4

This would be the best of both versions.

src/core/control/xojfile/LoadHandler.cpp Show resolved Hide resolved
@@ -472,7 +490,7 @@ void LoadHandler::parseBgPdf() {
attachToDocument = true;
// Handle old format separately
if (this->isGzFile) {
pdfFilename = (fs::path{xournalFilepath} += ".") += pdfFilename;
pdfFilename = (xournalFilepath += ".") += pdfFilename;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the copy was intentional here, since it is a field of LoadHandler.
Changing this might be ok in this scope, since the value is not used after this line, but might cause a bug, when this will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right. I shouldn't modify this. I will revert it.

@Febbe
Copy link
Collaborator

Febbe commented Apr 26, 2024

@bhennion what do you think of that solution?

@bhennion
Copy link
Contributor

I sounds good to me. I'd add a test unit for version 5 files, and conversion from version 4 to 5 though.

@Febbe
Copy link
Collaborator

Febbe commented Apr 28, 2024

bhennion commented Apr 26, 2024

It sounds good to me. I'd add a test unit for version 5 files, and conversion from version 4 to 5 though.

@panpaul are you willing to do this, or is that too much effort for you?

@Febbe Febbe added merge proposed Merge was proposed by maintainer file format Requires changes to the file format export Issues related to export labels Apr 28, 2024
@panpaul
Copy link
Contributor Author

panpaul commented Apr 29, 2024

@panpaul are you willing to do this, or is that too much effort for you?

Sorry, I was busy in the past few days. I'll add the related tests in the next few days.

@panpaul
Copy link
Contributor Author

panpaul commented Apr 29, 2024

I sounds good to me. I'd add a test unit for version 5 files,

I have some new findings while writing tests. I'm using a Windows ARM64 device, and it's the first time I've tried the mingw64 build. 😢

opened with file created with can open
v5 C v5 M yes
v5 C v4 C no [1]
v5 C v4 M no [2]
v4 C v4 C no [1]
v4 C v4 M yes
v5 M v5 C yes
v5 M v4 M yes
v5 M v4 C no [1]
v4 M v4 M yes
v4 M v4 C no [1]

v5 refers to builds from this PR (xopp file format version 5), while v4 refers to builds downloaded from the GitHub releases page (v1.2.3)

C represents the mingw-w64-clang-aarch64- toolchain, and M represents the mingw-w64-x86_64- toolchain

[1]: v4 C will store garbled characters, and g_markup_parse_context_parse will return an error when non-UTF-8 chacraters are found
[2]: loading UTF-8 string with fs::path will cause problem

In short, libc++ and libstdc++ have different behaviors with fs::path, I guess that's what unspecified means. (Though I haven't checked the source code)

Otherwise, if path::value_type is wchar_t, conversion, if any, is unspecified. This is the case on Windows, where wchar_t is 16 bit and the native encoding is UTF-16.

In libstdc++, both .string() and .u8string() will return a UTF-8 string, while libc++ will treat them differently.

and conversion from version 4 to 5 though.

The conversion should be automatic if the file can be opened, as we have dedicated 'load' and 'save' logic.

Based on the table above, the filepath stored in the xopp file might already be encoded in UTF-8 if using mingw64, which we want to ensure in v5.

Should we add a logic to check whether the string is already encoded with UTF-8 and load them with fs::path or fs::u8path? If so, I think we may not have to upgrade the file format version to 5. In addition, xournalpp officially only has the mingw64 build :)

Notes on how the test files are create:

  1. v4.* are created with binaries compiled from release-1.2
  2. *.mingw.* are created with the binaries built with mingw-w64-x86_64- toolchain while others are built with mingw-w64-clang-aarch64- toolchain
  3. I create the raw files with Open -> Annotate PDF (without checking attach)
  4. files with *.unzipped.* are unzipped from the raw files, and the filename field is modified to relative path
  5. files without unzipped are created with gzip {*.unzipped.*}
  6. 测试.pdf is copied from packaged_xopp/pdfBackground/old.xopp.bg.pdf

@Febbe
Copy link
Collaborator

Febbe commented Apr 29, 2024

Regarding [1], when is g_markup_parse_context_parse called?
Clang:std::format::path(std::format::path::string) should always create a valid path, even when the characters are interpreted as garbled with utf8.
Is that a bug in clang, or do we store the string formatted with any locale?

If so, I think we may not have to upgrade the file format version to 5

I think, since we ever only supported mingw, we can really change .string() to u8string() and path{...} to u8path(...) without any version bumb or utf-8 detection.
This would be a fix for clang and msvc custom builds only.

@Febbe
Copy link
Collaborator

Febbe commented Apr 29, 2024

Regarding [1], when is g_markup_parse_context_parse called?
Clang:std::format::path(std::format::path::string) should always create a valid path, even when the characters are interpreted as garbled with utf8.
Is that a bug in clang, or do we store the string formatted with any locale?

If so, I think we may not have to upgrade the file format version to 5

I think, since we ever only supported mingw, we can really change .string() to u8string() and path{...} to u8path(...) without any version bumb or utf-8 detection.
This would be a fix for clang and msvc custom builds only.

Test files, that test non western utf8 symbols would still be good.

@panpaul
Copy link
Contributor Author

panpaul commented Apr 29, 2024

Regarding [1], when is g_markup_parse_context_parse called?

do {
char buffer[1024];
len = readContentFile(buffer, sizeof(buffer));
if (len > 0) {
valid = g_markup_parse_context_parse(context, buffer, len, &error);
}
if (error) {
g_warning("LoadHandler::parseXml: %s\n", error->message);
valid = false;
break;
}
} while (len >= 0 && valid && !error);

Clang:std::format::path(std::format::path::string) should always create a valid path, even when the characters are interpreted as garbled with utf8.
Is that a bug in clang, or do we store the string formatted with any locale?

std::format::path ? I guess you want to say std::filesystem::path(std::string)

To be honest, I don't quite understand the constructor part...

If so, I think we may not have to upgrade the file format version to 5

I think, since we ever only supported mingw, we can really change .string() to u8string() and path{...} to u8path(...) without any version bumb or utf-8 detection.
This would be a fix for clang and msvc custom builds only.

Test files, that test non western utf8 symbols would still be good.

Sorry, I didn't catch you. So should I close this PR and wait for #5641 (and add tests for non western utf8 symbol in #5641)

@Febbe
Copy link
Collaborator

Febbe commented Apr 29, 2024

std::format::path ? I guess you want to say std::filesystem::path(std::string)

Yes

To be honest, I don't quite understand the constructor part...

It's just an invariant I assume. Constructing a path from the narrow string representation of another path via the constructor for narrow strings should result in an equal path object.

Sorry, I didn't catch you. So should I close this PR and wait for #5641 (and add tests for non western utf8 symbol in #5641)

No, just return to your first version without the version bump. Exporting paths with a clang/msvc compiled binary should also be correct.
So changing string to u8string is good
And changing the path construction from fs::path{...} to fs::u8path(...) is really mandatory.
People using our official installation won't have any negative impact.

@Febbe Febbe removed the file format Requires changes to the file format label Apr 30, 2024
@Febbe Febbe merged commit 87cade8 into xournalpp:release-1.2 Apr 30, 2024
5 checks passed
@Febbe
Copy link
Collaborator

Febbe commented Apr 30, 2024

Thank you @panpaul!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
export Issues related to export merge proposed Merge was proposed by maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing user interface file xournalpp-1.2.3-windows failed to set gui language to zh
4 participants