Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LoadHandler #5554

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Conversation

tmoerschell
Copy link
Contributor

This is a partial rewrite of the load handler, as had been suggested some long time ago should be necessary in #937.

This would close #5183. There are also many smaller issues with memory safety that are addressed with this. Ideally, it should be resilient against any ill-formed files.

I tried to split the one monster class into two smaller ones:

  • XmlParser reads the XML and parses all the data necessary for loading the file. It only cares about what that data says when it is relevant for parsing.
  • LoadHandler receives the data parsed by XmlParser and builds the document.

Commit structure:

07a9247 is a preparation for abstracting away where the XML is being read from.
4dfb59b is the addition of the whole XmlParser class. The load handler loads the file twice, once with the old interface, and once with the new one. A time measurement is printed so you can try to compare performance for yourself.
fdb2828 is the move away from streams for parsing stroke points and pressures.
745a4a0 is the cleanup of the old interface, and of the LoadHandler class in general.
e31a102 addresses the fact that alpha values were not read in the old parser (no visual difference: alpha values are unused in the file as I understand it), and that some tests were expecting that behavior.

Error handling:

This part is not quite finished up yet, but I wanted to get some feedback first:

  • Fatal errors are handled using exceptions: the load process is aborted (because it cannot reasonably continue), and the error is reported to the user. Language: user's language.
  • Recoverable errors, or other small issues (e.g. missing attribute, negative pressure values) are printed as a warning on the console and the missing value is set to some default. Language: English.
  • Then there is some in-between zone of errors that you'd like to report to the user, but that shouldn't necessarily block loading the document. And we should also print them on the console, should it be relevant for a crash log or the like. My solution for this is LoadHandler::logError() which will print a warning right away and store the message for it to be retrieved when loading finishes. This should be extended to the XmlParser class as well if it fits. It currently contains a lot of asserts that can be triggered by ill-formed files. Language: both would be nice (user language in the GUI and English in the console), but I couldn't figure out a way to do that nicely, so for now it's all English.

Performance:

The new XML Parser is somewhat more wordy, performs more checks, and generally copies data coming from a C API into strings or similar. It is consequently slower on some files, especially small ones. However, I believe the old code was copying Base64 encoded data more than necessary, as this new parser is clearly faster as soon as the file includes images. Now at first performance was actually terrible, and then I figured out that for parsing doubles, the stream operator>> was about twice as slow as g_ascii_strtod(). This seemed like a sufficiently compelling reason to use the latter.

Other notes:

In theory, the separation between parser and loader would make it possible to use parallel processing (parser continues reading while the loader builds the document), but I think (I tried) that'd be more trouble than it's worth.

The parser uses libxml2, which we already rely upon for the settings XML.

I know that the XmlParser::process*Node() functions are all very similar, but nevertheless, there are some small differences which kept me from writing a general form.

The *InputStream::open() functions are not used, should I make the minimal viable class, or is this fine in order to allow easy reuse?

How far along is the zip file format / Where is it going to get? Is it still flexible to changes? For example attachments for images and TEX images would be so much more straightforward to parse if they were given in an attribute.

I tried to provide some kind of forward compatibility (ignoring unknown nodes and such), but I'm not sure how effective that would really be.

Finally, this is intended as a (small) stepping stone for the new file format, which will bring a completely new parser. Ideally, it should be possible to separate zip and gzip file formats a bit more. I'm certainly willing to continue into that direction if considered useful.

@tmoerschell tmoerschell force-pushed the xml-parser branch 2 times, most recently from df6f332 to c0c358b Compare March 11, 2024 13:53
const auto attributeMap = getAttributeMap();

std::string creator;
auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap);
auto optVersion = XmlParserHelper::getAttrib<std::string>("version", attributeMap);

To be safe, all those strings should be defined as constexpr in some header(s), and used both here and in the SaveHandler. It's the only way to ensure there is no typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've put all the attribute names in a file XmlNames.h and a namespace XmlAttrs. Two questions for you:

  1. Should I make it more structured, that is, define XmlAttrs::Page::WIDTH_STR and XmlAttrs::Stroke::WIDTH_STR instead of just XmlAttrs::WIDTH_STR?
  2. I should probably do the same for tag names right?

edit: append "_STR", since there was a collision on the macOS build

@bhennion
Copy link
Contributor

I've only given it a short glance. This needs more time to evaluate.

About this:

Now at first performance was actually terrible, and then I figured out that for parsing doubles, the stream operator>> was about twice as slow as g_ascii_strtod(). This seemed like a sufficiently compelling reason to use the latter.

Are you sure? It may be slower with CMAKE_BUILD_TYPE = Debug, but not with Release or RelWithDebInfo. Compiler optimisation is more crucial with C++'s advanced features.

@tmoerschell
Copy link
Contributor Author

Take your time, this is a huge amount of code for you to review.

With a release build, operator>> is maybe not twice as slow anymore, but it does still make a significant difference. Here's a table of a few runs:

new time / old time test1.xoj big-test.xopp some-notes-1.xopp some-notes-2.xopp
operator>> 123 / 33 374612 / 220132 192611 / 121588 230333 / 148344
g_ascii_strtod 153 / 46 232082 / 222047 135932 / 123390 177994 / 148399

Copy link
Contributor

@bhennion bhennion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another partial review.
Overall, the code is pretty clean and easy to read, that good!
Some functions may benefit from a comment.
e.g. XmlParser::getAttributeMap(), explaining that this only reads the current node. Also, is there a way to assert that a node is only read once?

Another question: are you able to pinpoint places where a C to C++ string copy adds in complexity: meaning that the string is later only used as a const char* or so?
If so, in such places, you could use xoj::util::OwnedCString to avoid copying while maintaining mem-safety.

You may also gain a tiny bit by using xoj::util::EnumIndexedArray instead of unordered maps (e.g. in XmlParser::tagTypeToName()). Not sure you'd save a lot but maybe a little.

Ideally, having a benchmark function by function (when a similar function existed before this PR, and with -DCMAKE_BUILD_TYPE=Release) could help finding optimization possibilities.
Maybe you could add a disabled test unit which does some benchmarking, so it can be used again later, when someone else touches the LoadHandler. For a way to do so, you can look at this (unmerged) commit.

Finally: have you looked at the TestLoadHandler test unit? Does it look complete enough to catch any possible issues?

This is a very critical operation, so we must make sure nothing is broken, and we must optimize it quite a bit. I'm a bit scared of the loss of performances.

src/core/control/xojfile/GzInputStream.cpp Show resolved Hide resolved
src/core/control/xojfile/XmlNames.h Outdated Show resolved Hide resolved
src/core/model/Stroke.cpp Outdated Show resolved Hide resolved
src/core/control/xojfile/ZipInputStream.cpp Show resolved Hide resolved
src/util/include/util/Util.h Outdated Show resolved Hide resolved
std::string decodeBase64(const char* base64data);

// custom types for easier parsing using stream operators
enum class Domain { ABOSLUTE, ATTACH, CLONE };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum class Domain { ABOSLUTE, ATTACH, CLONE };
enum class Domain { ABSOLUTE, ATTACH, CLONE };

auto operator<<(std::ostream& stream, const XmlParserHelper::Domain domain) -> std::ostream& {
switch (domain) {
case XmlParserHelper::Domain::ABOSLUTE:
stream << "absolute";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, those strings should be constexpr defined in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings are also needed in the SaveHandler. But is it worth it to make a whole "AttachmentDomain" class in a separate file ? Where else would it fit ?

@tmoerschell
Copy link
Contributor Author

About strings / C-strings:
I am using xmlTextReaderConstName() and xmlTextReaderConstValue() instead of their counterparts without "Const", so I don't even own them (but they are reset when performing the next read operation). I'll look if there are places where we don't need a copy at all, but that seems unlikely. I was wondering if I should try to use move semantics more, but I don't think it's useful for "small" strings. I might try it out once I have proper benchmarking.

About unordered maps:
Is there any particular reason why EnumIndexedArray's inherited array is not public? If it was, I could write the tag names as a constexpr EnumIndexedArray. Or we would need to at least forward the correct constructor.
For tagNameToType() an unordered map seemed like the most idiomatic way to do it, but strcmp or even a switch case may be faster.

About the test unit:
I have used it for testing, and even found that it had been written to accommodate a small bug from the old LoadHandler (see 24a751f). However I haven't looked at it closely enough to tell if it really tests everything.
For what it's worth, I have also been using the new parser for my everyday usage (I like it risky), and have found no issues.

About benchmarking:
I'll try to whip something up. But generating valid random valid file content seems a little complicated if it's just for that. Is it ok if I just use the existing test files ? It probably would be nice to have larger or more representative samples.

@bhennion
Copy link
Contributor

bhennion commented Mar 25, 2024

About unordered maps:
Is there any particular reason why EnumIndexedArray's inherited array is not public? If it was, I could write the tag names as a constexpr EnumIndexedArray. Or we would need to at least forward the correct constructor.

We don't want to expose operator[](size_t) for instance. Forwarding the constructor is good, I think.

For tagNameToType() an unordered map seemed like the most idiomatic way to do it, but strcmp or even a switch case may be faster.

Yeah, the set of values is pretty small anyway.

About benchmarking:
I'll try to whip something up. But generating valid random valid file content seems a little complicated if it's just for that. Is it ok if I just use the existing test files ? It probably would be nice to have larger or more representative samples.

Random documents would not even remotely model use cases, and developping them would cost a lot of time. Typical documents are fine. E.g. one very long with only strokes, one with a lot of images, one with a long pdf and some annotations on top, and so on. The important thing is to be able to evaluate each kind of data more or less separately.

Rename XmlNames.h to XmlAttrs.h
Move TagType enum to new XmlTags namespace and use it to index the constexpr names
Add forwarding constructor to EnumIndexedArray
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SegFault when opening a corrupt file (produced by Xournal++ webapp)
3 participants