Refactor LoadHandler #5554

tmoerschell · 2024-03-11T02:01:25Z

This is a partial rewrite of the load handler, as had been suggested some long time ago should be necessary in #937.

This would close #5183. There are also many smaller issues with memory safety that are addressed with this. Ideally, it should be resilient against any ill-formed files.

I tried to split the one monster class into two smaller ones:

XmlParser reads the XML and parses all the data necessary for loading the file. It only cares about what that data says when it is relevant for parsing.
LoadHandler receives the data parsed by XmlParser and builds the document.

Commit structure:

07a9247 is a preparation for abstracting away where the XML is being read from.
4dfb59b is the addition of the whole XmlParser class. The load handler loads the file twice, once with the old interface, and once with the new one. A time measurement is printed so you can try to compare performance for yourself.
fdb2828 is the move away from streams for parsing stroke points and pressures.
745a4a0 is the cleanup of the old interface, and of the LoadHandler class in general.
e31a102 addresses the fact that alpha values were not read in the old parser (no visual difference: alpha values are unused in the file as I understand it), and that some tests were expecting that behavior.

Error handling:

This part is not quite finished up yet, but I wanted to get some feedback first:

Fatal errors are handled using exceptions: the load process is aborted (because it cannot reasonably continue), and the error is reported to the user. Language: user's language.
Recoverable errors, or other small issues (e.g. missing attribute, negative pressure values) are printed as a warning on the console and the missing value is set to some default. Language: English.
Then there is some in-between zone of errors that you'd like to report to the user, but that shouldn't necessarily block loading the document. And we should also print them on the console, should it be relevant for a crash log or the like. My solution for this is LoadHandler::logError() which will print a warning right away and store the message for it to be retrieved when loading finishes. This should be extended to the XmlParser class as well if it fits. It currently contains a lot of asserts that can be triggered by ill-formed files. Language: both would be nice (user language in the GUI and English in the console), but I couldn't figure out a way to do that nicely, so for now it's all English.

Performance:

The new XML Parser is somewhat more wordy, performs more checks, and generally copies data coming from a C API into strings or similar. It is consequently slower on some files, especially small ones. However, I believe the old code was copying Base64 encoded data more than necessary, as this new parser is clearly faster as soon as the file includes images. Now at first performance was actually terrible, and then I figured out that for parsing doubles, the stream operator>> was about twice as slow as g_ascii_strtod(). This seemed like a sufficiently compelling reason to use the latter.

Other notes:

In theory, the separation between parser and loader would make it possible to use parallel processing (parser continues reading while the loader builds the document), but I think (I tried) that'd be more trouble than it's worth.

The parser uses libxml2, which we already rely upon for the settings XML.

I know that the XmlParser::process*Node() functions are all very similar, but nevertheless, there are some small differences which kept me from writing a general form.

The *InputStream::open() functions are not used, should I make the minimal viable class, or is this fine in order to allow easy reuse?

How far along is the zip file format / Where is it going to get? Is it still flexible to changes? For example attachments for images and TEX images would be so much more straightforward to parse if they were given in an attribute.

I tried to provide some kind of forward compatibility (ignoring unknown nodes and such), but I'm not sure how effective that would really be.

Finally, this is intended as a (small) stepping stone for the new file format, which will bring a completely new parser. Ideally, it should be possible to separate zip and gzip file formats a bit more. I'm certainly willing to continue into that direction if considered useful.

bhennion · 2024-03-11T16:49:30Z

src/core/control/xojfile/XmlParser.cpp

+    const auto attributeMap = getAttributeMap();
+
+    std::string creator;
+    auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap);


Suggested change

auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap);

auto optVersion = XmlParserHelper::getAttrib<std::string>("version", attributeMap);

To be safe, all those strings should be defined as constexpr in some header(s), and used both here and in the SaveHandler. It's the only way to ensure there is no typo.

Ok, I've put all the attribute names in a file XmlNames.h and a namespace XmlAttrs. Two questions for you:

Should I make it more structured, that is, define XmlAttrs::Page::WIDTH_STR and XmlAttrs::Stroke::WIDTH_STR instead of just XmlAttrs::WIDTH_STR?

I should probably do the same for tag names right?

edit: append "_STR", since there was a collision on the macOS build

bhennion · 2024-03-11T16:54:58Z

I've only given it a short glance. This needs more time to evaluate.

About this:

Now at first performance was actually terrible, and then I figured out that for parsing doubles, the stream operator>> was about twice as slow as g_ascii_strtod(). This seemed like a sufficiently compelling reason to use the latter.

Are you sure? It may be slower with CMAKE_BUILD_TYPE = Debug, but not with Release or RelWithDebInfo. Compiler optimisation is more crucial with C++'s advanced features.

tmoerschell · 2024-03-11T22:46:34Z

Take your time, this is a huge amount of code for you to review.

With a release build, operator>> is maybe not twice as slow anymore, but it does still make a significant difference. Here's a table of a few runs:

new time / old time	test1.xoj	big-test.xopp	some-notes-1.xopp	some-notes-2.xopp
`operator>>`	123 / 33	374612 / 220132	192611 / 121588	230333 / 148344
`g_ascii_strtod`	153 / 46	232082 / 222047	135932 / 123390	177994 / 148399

bhennion

Here is another partial review.
Overall, the code is pretty clean and easy to read, that good!
Some functions may benefit from a comment.
e.g. XmlParser::getAttributeMap(), explaining that this only reads the current node. Also, is there a way to assert that a node is only read once?

Another question: are you able to pinpoint places where a C to C++ string copy adds in complexity: meaning that the string is later only used as a const char* or so?
If so, in such places, you could use xoj::util::OwnedCString to avoid copying while maintaining mem-safety.

You may also gain a tiny bit by using xoj::util::EnumIndexedArray instead of unordered maps (e.g. in XmlParser::tagTypeToName()). Not sure you'd save a lot but maybe a little.

Ideally, having a benchmark function by function (when a similar function existed before this PR, and with -DCMAKE_BUILD_TYPE=Release) could help finding optimization possibilities.
Maybe you could add a disabled test unit which does some benchmarking, so it can be used again later, when someone else touches the LoadHandler. For a way to do so, you can look at this (unmerged) commit.

Finally: have you looked at the TestLoadHandler test unit? Does it look complete enough to catch any possible issues?

This is a very critical operation, so we must make sure nothing is broken, and we must optimize it quite a bit. I'm a bit scared of the loss of performances.

src/core/control/xojfile/GzInputStream.cpp

src/core/control/xojfile/XmlNames.h

src/core/model/Stroke.cpp

src/core/control/xojfile/ZipInputStream.cpp

src/util/include/util/Util.h

bhennion · 2024-03-20T19:20:29Z

src/core/control/xojfile/XmlParserHelper.h

+std::string decodeBase64(const char* base64data);
+
+// custom types for easier parsing using stream operators
+enum class Domain { ABOSLUTE, ATTACH, CLONE };


Suggested change

enum class Domain { ABOSLUTE, ATTACH, CLONE };

enum class Domain { ABSOLUTE, ATTACH, CLONE };

bhennion · 2024-03-20T19:22:33Z

src/core/control/xojfile/XmlParserHelper.cpp

+auto operator<<(std::ostream& stream, const XmlParserHelper::Domain domain) -> std::ostream& {
+    switch (domain) {
+        case XmlParserHelper::Domain::ABOSLUTE:
+            stream << "absolute";


Here as well, those strings should be constexpr defined in this file.

These strings are also needed in the SaveHandler. But is it worth it to make a whole "AttachmentDomain" class in a separate file ? Where else would it fit ?

tmoerschell · 2024-03-24T22:30:09Z

About strings / C-strings:
I am using xmlTextReaderConstName() and xmlTextReaderConstValue() instead of their counterparts without "Const", so I don't even own them (but they are reset when performing the next read operation). I'll look if there are places where we don't need a copy at all, but that seems unlikely. I was wondering if I should try to use move semantics more, but I don't think it's useful for "small" strings. I might try it out once I have proper benchmarking.

About unordered maps:
Is there any particular reason why EnumIndexedArray's inherited array is not public? If it was, I could write the tag names as a constexpr EnumIndexedArray. Or we would need to at least forward the correct constructor.
For tagNameToType() an unordered map seemed like the most idiomatic way to do it, but strcmp or even a switch case may be faster.

About the test unit:
I have used it for testing, and even found that it had been written to accommodate a small bug from the old LoadHandler (see 24a751f). However I haven't looked at it closely enough to tell if it really tests everything.
For what it's worth, I have also been using the new parser for my everyday usage (I like it risky), and have found no issues.

About benchmarking:
I'll try to whip something up. But generating valid random valid file content seems a little complicated if it's just for that. Is it ok if I just use the existing test files ? It probably would be nice to have larger or more representative samples.

bhennion · 2024-03-25T08:14:54Z

About unordered maps:
Is there any particular reason why EnumIndexedArray's inherited array is not public? If it was, I could write the tag names as a constexpr EnumIndexedArray. Or we would need to at least forward the correct constructor.

We don't want to expose operator[](size_t) for instance. Forwarding the constructor is good, I think.

For tagNameToType() an unordered map seemed like the most idiomatic way to do it, but strcmp or even a switch case may be faster.

Yeah, the set of values is pretty small anyway.

About benchmarking:
I'll try to whip something up. But generating valid random valid file content seems a little complicated if it's just for that. Is it ok if I just use the existing test files ? It probably would be nice to have larger or more representative samples.

Random documents would not even remotely model use cases, and developping them would cost a lot of time. Typical documents are fine. E.g. one very long with only strokes, one with a lot of images, one with a long pdf and some annotations on top, and so on. The important thing is to be able to evaluate each kind of data more or less separately.

Rename XmlNames.h to XmlAttrs.h Move TagType enum to new XmlTags namespace and use it to index the constexpr names Add forwarding constructor to EnumIndexedArray

Add InputStream and derived classes

3275302

tmoerschell force-pushed the xml-parser branch 2 times, most recently from df6f332 to c0c358b Compare March 11, 2024 13:53

tmoerschell added 4 commits March 11, 2024 15:07

Add new XML parser

3e652aa

Use g_ascii_strtod() instead of streams

f0cf8e5

Remove old LoadHandler code and finalize transition

5bd6151

Fix missing alpha values in load handler test

24a751f

tmoerschell force-pushed the xml-parser branch from c0c358b to 24a751f Compare March 11, 2024 14:09

bhennion reviewed Mar 11, 2024

View reviewed changes

Add template specialization declarations

9ac958b

tmoerschell force-pushed the xml-parser branch from b66dd0a to a374154 Compare March 11, 2024 22:10

Add and use constexpr attribute names

9bbad93

tmoerschell force-pushed the xml-parser branch from a374154 to 9bbad93 Compare March 11, 2024 22:30

bhennion reviewed Mar 20, 2024

View reviewed changes

tmoerschell added 3 commits April 7, 2024 21:46

Use more explicit names in 'XmlAttrs'

6c82d7a

Add assertions in InputStream::open()

c070ccd

Only use cxxabi when available

5e089a9

tmoerschell force-pushed the xml-parser branch from de993ad to 5e089a9 Compare April 7, 2024 20:35

Make tag names constexpr

4ce23d6

Rename XmlNames.h to XmlAttrs.h Move TagType enum to new XmlTags namespace and use it to index the constexpr names Add forwarding constructor to EnumIndexedArray

tmoerschell force-pushed the xml-parser branch from 615e32b to e1205fd Compare April 7, 2024 22:32

Use constexpr names for StrokeTool and StrokeCapStyle

6397a1d

tmoerschell force-pushed the xml-parser branch from e1205fd to 6397a1d Compare April 7, 2024 22:34

Use StrokeTool and StrokeCapStyle constexpr names in SaveHandler

f383594

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LoadHandler #5554

Refactor LoadHandler #5554

tmoerschell commented Mar 11, 2024

bhennion Mar 11, 2024

tmoerschell Mar 11, 2024 •

edited

bhennion commented Mar 11, 2024

tmoerschell commented Mar 11, 2024

bhennion left a comment •

edited

bhennion Mar 20, 2024

bhennion Mar 20, 2024

tmoerschell Apr 7, 2024

tmoerschell commented Mar 24, 2024

bhennion commented Mar 25, 2024 •

edited

	auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap);
	auto optVersion = XmlParserHelper::getAttrib<std::string>("version", attributeMap);

	enum class Domain { ABOSLUTE, ATTACH, CLONE };
	enum class Domain { ABSOLUTE, ATTACH, CLONE };

Refactor LoadHandler #5554

Are you sure you want to change the base?

Refactor LoadHandler #5554

Conversation

tmoerschell commented Mar 11, 2024

Commit structure:

Error handling:

Performance:

Other notes:

bhennion Mar 11, 2024

Choose a reason for hiding this comment

tmoerschell Mar 11, 2024 • edited

Choose a reason for hiding this comment

bhennion commented Mar 11, 2024

tmoerschell commented Mar 11, 2024

bhennion left a comment • edited

Choose a reason for hiding this comment

bhennion Mar 20, 2024

Choose a reason for hiding this comment

bhennion Mar 20, 2024

Choose a reason for hiding this comment

tmoerschell Apr 7, 2024

Choose a reason for hiding this comment

tmoerschell commented Mar 24, 2024

bhennion commented Mar 25, 2024 • edited

tmoerschell Mar 11, 2024 •

edited

bhennion left a comment •

edited

bhennion commented Mar 25, 2024 •

edited