-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compression #9
Comments
That is a good point, there are no plans at the moment to add compression after deduping the data into chunks. |
Maybe the work of @klauspost can help ? |
@flibustenet - by request of the Docker devs, I have made sure that gzip is deterministic on similar input with similar code version and compression level. @fwessels - I have added functionality to my package, that will make it skip uncompressible content at ~200MB/s. deflate/gzip adds ~0.01% size to uncompressible content, so there isn't a huge risk by enabling it. |
@klauspost tak! Do you mean you store the content as 'stored', as in (I believe) the mode '0' of the zip format? BTW Great work on the library! |
No, it will still be stored as "deflated", but deflate splits input into blocks typically up to 64k. Each has 3 modes, uncompressed, static Huffman (mostly useless) or dynamic Huffman compressed. If a block is deemed uncompressable it will just store the block as literal bytes. The overhead is the block header + block size. Usually it is pretty slow to detect if content is compressible or not, but with an algorithm from LZ4 it will skip uncompressible data quite fast. Therefore you do not pay the compression cost on uncompressible data. The "uncompresible" detection resets at regular intervals - it varies on compression level, but typically every 64KB. |
@klauspost Thanks for the clear explanation, I will give it a try. One thing though is that essentially two mode are supported when pushing into the cloud, namely deduped and hydrated (see https://github.com/s3git/s3git/blob/master/BLAKE2.md#cloud-storage). For the hydrated mode you'd have to decompress the chunks again which is of course not an issue... Currently the chunks are of fixed (configurable) length, one thing that I want to add is a rolling hash mechanism like used in https://github.com/restic/chunker which will result in variable length chunks -- any thoughts on how to best integrate this as well? |
When we use compression before adding a file, it prevent deduplication. (the same for encryption).
It could be fine to compress data at storage level.
Is it planned ?
The text was updated successfully, but these errors were encountered: