Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression #9

Open
flibustenet opened this issue May 25, 2016 · 6 comments
Open

Compression #9

flibustenet opened this issue May 25, 2016 · 6 comments

Comments

@flibustenet
Copy link

When we use compression before adding a file, it prevent deduplication. (the same for encryption).
It could be fine to compress data at storage level.
Is it planned ?

@fwessels
Copy link
Collaborator

That is a good point, there are no plans at the moment to add compression after deduping the data into chunks.
I guess that for some content such as text of ASCII data it can be quite useful, however for e.g. video it would not give any (meaningful) reduction in size but just add processing overhead.
Unfortunately at the chunk level there is no easy way to tell whether a chunk (primarily) contains text or binary data so you can't have the best of both worlds...

@flibustenet
Copy link
Author

Maybe the work of @klauspost can help ?
https://github.com/klauspost/compress

@klauspost
Copy link

@flibustenet - by request of the Docker devs, I have made sure that gzip is deterministic on similar input with similar code version and compression level.

@fwessels - I have added functionality to my package, that will make it skip uncompressible content at ~200MB/s. deflate/gzip adds ~0.01% size to uncompressible content, so there isn't a huge risk by enabling it.

@fwessels
Copy link
Collaborator

fwessels commented May 25, 2016

@klauspost tak!

Do you mean you store the content as 'stored', as in (I believe) the mode '0' of the zip format?
And you detect whether or not to compress based on compressing the first part of the chunk/stream?

BTW Great work on the library!

@klauspost
Copy link

Do you mean you store the content as 'stored', as in (I believe) the mode '0' of the zip format?

No, it will still be stored as "deflated", but deflate splits input into blocks typically up to 64k. Each has 3 modes, uncompressed, static Huffman (mostly useless) or dynamic Huffman compressed. If a block is deemed uncompressable it will just store the block as literal bytes. The overhead is the block header + block size.

Usually it is pretty slow to detect if content is compressible or not, but with an algorithm from LZ4 it will skip uncompressible data quite fast. Therefore you do not pay the compression cost on uncompressible data. The "uncompresible" detection resets at regular intervals - it varies on compression level, but typically every 64KB.

@fwessels
Copy link
Collaborator

@klauspost Thanks for the clear explanation, I will give it a try.

One thing though is that essentially two mode are supported when pushing into the cloud, namely deduped and hydrated (see https://github.com/s3git/s3git/blob/master/BLAKE2.md#cloud-storage). For the hydrated mode you'd have to decompress the chunks again which is of course not an issue...

Currently the chunks are of fixed (configurable) length, one thing that I want to add is a rolling hash mechanism like used in https://github.com/restic/chunker which will result in variable length chunks -- any thoughts on how to best integrate this as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants