Binary Reproduceability #3484

benjaminwinger · 2024-05-13T20:17:07Z

In addition to testing databases created on one platform on different platforms, it would be useful to do direct binary comparisons between the databases (when produced in single threaded mode). That will help cover gaps in the coverage of the binary database tests, and can also enforce that kuzu's output is deterministic and the data is being properly initialized.

It could be implemented by creating a new database alongside the binary database tests, and comparing them with something like diff -r from GNU diffutils. An older version of diffutils is available for windows here (or via chocolatey).

The text was updated successfully, but these errors were encountered:

mewim · 2024-05-13T20:19:09Z

If the database is created under different environment, is the binary directly comparable? Could it depend on parallelism, etc?

benjaminwinger · 2024-05-13T20:27:32Z

It can depend on parallelism when copying, but if we always create databases in single-threaded mode when doing such comparisons it should be able to be directly comparable.
The only differences (except for endianness, but in that case the databases won't be compatible at all) should be the order that the input data is processed when copying, but if we do it single-threaded that should be consistent. Or should be able to be at least. Maybe it turns out that we're doing something inherently non-deterministic, but the databases seemed to be almost identical between the different platforms when I was debugging kuzudb/explorer#129, and it shouldn't be very complicated to set up so I think we should see if it works.

benjaminwinger · 2024-05-15T16:24:52Z

As noted in #3501, the catalog stores and serializes catalog entries using the order produced by an std::unordered_map, which is not consistent across stdlibs. If we want to make them consistent, we could either use an ordered map, or sort the entries by name before serializing them.

ray6080 · 2024-05-15T16:44:33Z

If we want to make them consistent, we could either use an ordered map, or sort the entries by name before serializing them.

Serialize catalog entries in a certain order makes more sense to me.

benjaminwinger · 2024-05-29T17:53:29Z

I've found that one way of ensuring that directly writing a struct to disk works consistently is to assert that it has a unique object representation via std::has_unique_object_representations_v. Among other things it ensures there is no padding (which isn't guaranteed to be zeroed and particularly in release mode will usually be uninitialized).
Unfortunately it doesn't allow floats/doubles because their spec allows for multiple representations of equal values, which limits its usefulness. I suspect the float/double ambiguity is generally fine though, as I doubt that we will encounter issues with, for example, zero being stored as negative zero on a different platform and breaking binary equality.

benjaminwinger mentioned this issue May 15, 2024

Make serializer tool able to be run standalone #3501

Merged

benjaminwinger mentioned this issue May 29, 2024

Disk array packed headers #3557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary Reproduceability #3484

Binary Reproduceability #3484

benjaminwinger commented May 13, 2024

mewim commented May 13, 2024

benjaminwinger commented May 13, 2024

benjaminwinger commented May 15, 2024

ray6080 commented May 15, 2024 •

edited

benjaminwinger commented May 29, 2024

Binary Reproduceability #3484

Binary Reproduceability #3484

Comments

benjaminwinger commented May 13, 2024

mewim commented May 13, 2024

benjaminwinger commented May 13, 2024

benjaminwinger commented May 15, 2024

ray6080 commented May 15, 2024 • edited

benjaminwinger commented May 29, 2024

ray6080 commented May 15, 2024 •

edited