Overallocation of arrays now seems to persist through precompile serialization on 1.11+ #54409

KristofferC · 2024-05-08T13:49:04Z

Creating the following package:

module OverAlloc

function make_arr()
    x = Float64[];
    for i = 1:175;
        push!(x, 0.0);
    end
    return x
end

const myarr = [make_arr() for x in 1:100000]

greet() = print("Hello World!")

end # module OverAlloc

results in a pkgimage that is 140MB on 1.10 but 554 MB on master. Might be related to #53570

The text was updated successfully, but these errors were encountered:

KristofferC · 2024-05-08T13:57:36Z

The reason for this (at least from my understanding in how it has been explained to me) is that an overallocated array (since moving the array implementation to Julia) now just contains a big Memory and in theory that Memory could be shared with other objects making it unsafe to automatically mutate on serialization.

StefanKarpinski · 2024-05-08T14:11:05Z

We generally try to preserve the entire object graph structure when serializing and deserializing, but in this case we may need to do something else. We could save each Array with its own shrunk copy of the Memory that it wraps. The down side of that is if you're serializing a bunch of arrays that share memory, then you'll bloat the serialized data and perhaps more importantly, when deserializing they will no longer be "connected" and changes to one will not appear in the other. Perhaps that's ok? What guarantees do we make about mutation and sharing with arrays? Another possibility would be to defer the serialization of Memory buffers until later and keep track of how much of it is actually used. But that feels a bit too fancy to me.

StefanKarpinski · 2024-05-08T14:12:24Z

If you have two arrays referencing the same memory but using different extents of it, is it necessarily the case that one has been resized so they could have been detached from each other?

topolarity · 2024-05-08T14:38:42Z

The only ways that I'm aware of to create two directly aliasing Arrays (and not just e.g. a view) are:

y = reshape(x, ...)
y = unsafe_wrap(typeof(x), pointer(x), size(x))
y = wrap(Array, x.ref.mem, size(x))

I'm not sure what our (1) behavior for pkgimages is right now, but (2) already does not survive pkgimage serialization and (3) did not exist until 1.11.

vtjnash · 2024-05-08T17:34:49Z

unsafe_wrap is already UB if the pointer already has a Julia accessible reference anywhere else, so that does not matter. But there are equivalent calls for reshape or wrap from a Memory which do not need to call any unsafe functions.

StefanKarpinski · 2024-05-08T19:37:50Z

On 1.10 (and 1.11) aliasing of array memory does not survive serialization and deserialization:

julia> a = [1]
1-element Vector{Int64}:
 1

julia> b = reshape(a, 1, 1)
1×1 Matrix{Int64}:
 1

julia> using Serialization

julia> tmp = tempname();

julia> open(tmp, write=true) do io
           serialize(io, a)
           serialize(io, b)
       end
8

julia> a′, b′ = open(tmp, read=true) do io
           deserialize(io),
           deserialize(io)
       end
([1], [1;;])

julia> b[1] = 2
2

julia> a
1-element Vector{Int64}:
 2

julia> b′[1] = 2
2

julia> a′
1-element Vector{Int64}:
 1

So we're totally free to trim the data here since each array object is serialized with its own copy of the memory. In fact, it would be technically breaking to share memory between different serialized arrays.

vtjnash · 2024-05-08T19:44:09Z

That is a different, and unrelated, serialization protocol to the one being mentioned in the original problem

martinholters · 2024-05-16T12:34:44Z

If we zero-initialized all memory, we could exploit that by efficiently representing Memory with trailing zeros during serialization...

StefanKarpinski · 2024-05-24T14:50:46Z

Oh, that's a clever idea! In general, I have found that zero-run-length encoding is a shockingly effective simple compression strategy. So much so that I even wrote a simple C tool for zrl encoding and decoding of data: https://github.com/StefanKarpinski/zrl.

Aside: Why did I write this tool? It's a somewhat involved story. At one point, @staticfloat and I were considering serving binary diffs of packages and artifacts generated with the bsdiff tool through the pkg servers, so that people don't need to redownload everything when only a little bit has changed. We might still do it if we ever get the bandwidth to implement it, but it adds significant complexity to the pkg servers, so it's not a clear win. When I was testing that and trying to optimize both diff size and the speed of diff application to generate new data, I found that the single most effective optimization you can do is to zero-run-length encode the old and new data before generating the diff. Why? Because there tend to be a lot of sequences of zeros in files, especially binaries and tarballs; and the bsdiff algorithm does not do very well when there are a lot of different sequences that are suffixes of each other, like all the different lengths of zero sequences. By zrl encoding the data, you both make it drastically shorter in a very cheap way, and you make those sequences of zeros no longer subsequences of each other—they are all encoded as a single zero byte and then an LEB128 integer. Anyway, this causes the diffs to be much smaller and much, much faster to apply. Maybe someday we'll actually implement diff serving...

KristofferC added the kind:regression Regression in behavior compared to a previous version label May 8, 2024

KristofferC added the compiler:precompilation Precompilation of modules label May 9, 2024

KristofferC changed the title ~~Overallocation of arrays now seems to persist through serialization on 1.11+~~ Overallocation of arrays now seems to persist through precompile serialization on 1.11+ May 9, 2024

KristofferC added the kind:release Release management and versioning. label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overallocation of arrays now seems to persist through precompile serialization on 1.11+ #54409

Overallocation of arrays now seems to persist through precompile serialization on 1.11+ #54409

KristofferC commented May 8, 2024

KristofferC commented May 8, 2024

StefanKarpinski commented May 8, 2024

StefanKarpinski commented May 8, 2024

topolarity commented May 8, 2024 •

edited

vtjnash commented May 8, 2024

StefanKarpinski commented May 8, 2024

vtjnash commented May 8, 2024

martinholters commented May 16, 2024

StefanKarpinski commented May 24, 2024

Overallocation of arrays now seems to persist through precompile serialization on 1.11+ #54409

Overallocation of arrays now seems to persist through precompile serialization on 1.11+ #54409

Comments

KristofferC commented May 8, 2024

KristofferC commented May 8, 2024

StefanKarpinski commented May 8, 2024

StefanKarpinski commented May 8, 2024

topolarity commented May 8, 2024 • edited

vtjnash commented May 8, 2024

StefanKarpinski commented May 8, 2024

vtjnash commented May 8, 2024

martinholters commented May 16, 2024

StefanKarpinski commented May 24, 2024

topolarity commented May 8, 2024 •

edited