Compression #9

flibustenet · 2016-05-25T06:57:21Z

When we use compression before adding a file, it prevent deduplication. (the same for encryption).
It could be fine to compress data at storage level.
Is it planned ?

fwessels · 2016-05-25T07:32:42Z

That is a good point, there are no plans at the moment to add compression after deduping the data into chunks.
I guess that for some content such as text of ASCII data it can be quite useful, however for e.g. video it would not give any (meaningful) reduction in size but just add processing overhead.
Unfortunately at the chunk level there is no easy way to tell whether a chunk (primarily) contains text or binary data so you can't have the best of both worlds...

flibustenet · 2016-05-25T08:58:54Z

Maybe the work of @klauspost can help ?
https://github.com/klauspost/compress

klauspost · 2016-05-25T12:54:06Z

@flibustenet - by request of the Docker devs, I have made sure that gzip is deterministic on similar input with similar code version and compression level.

@fwessels - I have added functionality to my package, that will make it skip uncompressible content at ~200MB/s. deflate/gzip adds ~0.01% size to uncompressible content, so there isn't a huge risk by enabling it.

fwessels · 2016-05-25T13:39:32Z

@klauspost tak!

Do you mean you store the content as 'stored', as in (I believe) the mode '0' of the zip format?
And you detect whether or not to compress based on compressing the first part of the chunk/stream?

BTW Great work on the library!

klauspost · 2016-05-25T14:53:50Z

Do you mean you store the content as 'stored', as in (I believe) the mode '0' of the zip format?

No, it will still be stored as "deflated", but deflate splits input into blocks typically up to 64k. Each has 3 modes, uncompressed, static Huffman (mostly useless) or dynamic Huffman compressed. If a block is deemed uncompressable it will just store the block as literal bytes. The overhead is the block header + block size.

Usually it is pretty slow to detect if content is compressible or not, but with an algorithm from LZ4 it will skip uncompressible data quite fast. Therefore you do not pay the compression cost on uncompressible data. The "uncompresible" detection resets at regular intervals - it varies on compression level, but typically every 64KB.

fwessels · 2016-05-25T17:46:01Z

@klauspost Thanks for the clear explanation, I will give it a try.

One thing though is that essentially two mode are supported when pushing into the cloud, namely deduped and hydrated (see https://github.com/s3git/s3git/blob/master/BLAKE2.md#cloud-storage). For the hydrated mode you'd have to decompress the chunks again which is of course not an issue...

Currently the chunks are of fixed (configurable) length, one thing that I want to add is a rolling hash mechanism like used in https://github.com/restic/chunker which will result in variable length chunks -- any thoughts on how to best integrate this as well?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression #9

Compression #9

flibustenet commented May 25, 2016

fwessels commented May 25, 2016

flibustenet commented May 25, 2016

klauspost commented May 25, 2016

fwessels commented May 25, 2016 •

edited

klauspost commented May 25, 2016

fwessels commented May 25, 2016

Compression #9

Compression #9

Comments

flibustenet commented May 25, 2016

fwessels commented May 25, 2016

flibustenet commented May 25, 2016

klauspost commented May 25, 2016

fwessels commented May 25, 2016 • edited

klauspost commented May 25, 2016

fwessels commented May 25, 2016

fwessels commented May 25, 2016 •

edited