Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary Reproduceability #3484

Open
benjaminwinger opened this issue May 13, 2024 · 5 comments
Open

Binary Reproduceability #3484

benjaminwinger opened this issue May 13, 2024 · 5 comments

Comments

@benjaminwinger
Copy link
Collaborator

In addition to testing databases created on one platform on different platforms, it would be useful to do direct binary comparisons between the databases (when produced in single threaded mode). That will help cover gaps in the coverage of the binary database tests, and can also enforce that kuzu's output is deterministic and the data is being properly initialized.

It could be implemented by creating a new database alongside the binary database tests, and comparing them with something like diff -r from GNU diffutils. An older version of diffutils is available for windows here (or via chocolatey).

@mewim
Copy link
Collaborator

mewim commented May 13, 2024

If the database is created under different environment, is the binary directly comparable? Could it depend on parallelism, etc?

@benjaminwinger
Copy link
Collaborator Author

It can depend on parallelism when copying, but if we always create databases in single-threaded mode when doing such comparisons it should be able to be directly comparable.
The only differences (except for endianness, but in that case the databases won't be compatible at all) should be the order that the input data is processed when copying, but if we do it single-threaded that should be consistent. Or should be able to be at least. Maybe it turns out that we're doing something inherently non-deterministic, but the databases seemed to be almost identical between the different platforms when I was debugging kuzudb/explorer#129, and it shouldn't be very complicated to set up so I think we should see if it works.

@benjaminwinger
Copy link
Collaborator Author

As noted in #3501, the catalog stores and serializes catalog entries using the order produced by an std::unordered_map, which is not consistent across stdlibs. If we want to make them consistent, we could either use an ordered map, or sort the entries by name before serializing them.

@ray6080
Copy link
Contributor

ray6080 commented May 15, 2024

If we want to make them consistent, we could either use an ordered map, or sort the entries by name before serializing them.

Serialize catalog entries in a certain order makes more sense to me.

@benjaminwinger
Copy link
Collaborator Author

I've found that one way of ensuring that directly writing a struct to disk works consistently is to assert that it has a unique object representation via std::has_unique_object_representations_v. Among other things it ensures there is no padding (which isn't guaranteed to be zeroed and particularly in release mode will usually be uninitialized).
Unfortunately it doesn't allow floats/doubles because their spec allows for multiple representations of equal values, which limits its usefulness. I suspect the float/double ambiguity is generally fine though, as I doubt that we will encounter issues with, for example, zero being stored as negative zero on a different platform and breaking binary equality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants