Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3git snapshots read entire (potentially huge) data files into memory with probability 1 in 64 #21

Open
vsivsi opened this issue Mar 10, 2017 · 13 comments

Comments

@vsivsi
Copy link

vsivsi commented Mar 10, 2017

When used to commit snapshots, s3git checks to see whether each file is stored in the snapshot deduped format. Described here

The code that performs this check for each file in the snapshot implements a quick "prefilter" using the length of each file:

// Check whether the size contains at least two keys or is a multiple of 64 bytes
if stat.Size() < KeySize*2 || stat.Size() & (KeySize-1) != 0 {
	return false, "", nil, nil
}

Any file that meets these criteria (a 1 in 64 chance) is then read entirely into memory to attempt to verify that the presumed root hash (the last 64 bytes of the file) is the hash of the concatenation of all of the bytes that precede it.

Although it may seem safe to read a valid deduped file into memory like this (an 8TB file will only produce a 100MB deduped file), the problem is that the very purpose of this check is to determine if this is in fact a deduped file or not. If it is not, then it may be a huge file (e.g. an exactly 8TB file) that meets the criteria above of len >= 128 && (len & 63) == 0.

This is obviously bad, and probably two things need to change to fix it.

  1. For safety and stability the memory used by this process needs to be limited. The use of ioutil.ReadFile(filename) (usually a bad code smell) needs to be replaced with functionality that reads the file in chunks, streaming them into the hash calculation, so the memory use is safely bounded.

  2. For performance some more stringent test needs to be made to inexpensively reject virtually all non-deduped datafiles, to avoid unnecessarily reading/hashing huge files:

    • A unique "magic number" of sufficient length could written to the beginning (or end) of a valid deduped file to be cheaply checked. This would obviously be a breaking change to the data format, but it would prevent unnecessarily reading/hashing 1 out of every 64 files.
    • Or perhaps there could be some upper-limit on the size of a valid deduped file, implying an upper-limit on the number of blocks that can belong to one file. A limit of 16 million blocks (1 GB deduped) would allow individual file sizes of ~88TB with the default blocksize. If you are dealing with truly huge files, the default block size of 5MB is probably too small, so by increasing that, you should be able to manage this limit far into the future. This limit would eliminate all files > 1GB as possible deduped files without breaking the format itself.

I'm sure there are other possibile solutions I haven't immediately thought of...

Anyway, I consider this to be a serious issue that makes the current "snapshot" function unsuitable for any dataset with large-ish files (say... bigger than 5-10% of available RAM).

@vsivsi vsivsi changed the title s3git will attempt to read an entire data file into memory under some circumstances s3git snapshots read potentially huge entire data files into memory with probability 1 in 64 Mar 10, 2017
@vsivsi vsivsi changed the title s3git snapshots read potentially huge entire data files into memory with probability 1 in 64 s3git snapshots read entire (potentially huge) data files into memory with probability 1 in 64 Mar 10, 2017
@vsivsi
Copy link
Author

vsivsi commented Mar 10, 2017

To reproduce the behavior described above.

mkdir s3git_test
cd s3git_test
s3git init
# Create a 4GB file
head -c $((4*1024*1024*1024)) /dev/zero > test.bin
# Now create a snapshot
/usr/bin/time -l s3git snapshot create . -m 'BOOM'

Under MacOS this requires ~17GB of memory to complete:

       82.22 real        47.62 user        26.67 sys
17877319680  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
   4536034  page reclaims
        80  page faults
         0  swaps
       130  block input operations
       196  block output operations
         0  messages sent
         0  messages received
         3  signals received
    997446  voluntary context switches
     20409  involuntary context switches

If you modify the test to produce a file that is 4GB-1 in length, you get a very different result:

mkdir s3git_test
cd s3git_test
s3git init
# Create a 4GB - 1 file
head -c $((4*1024*1024*1024-1)) /dev/zero > test.bin
# Now create a snapshot
/usr/bin/time -l s3git snapshot create . -m 'boom?'

In this case it completes using only 23 MB of memory!

       29.54 real        13.48 user         7.51 sys
  23539712  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     15503  page reclaims
         7  page faults
         0  swaps
         2  block input operations
       214  block output operations
         0  messages sent
         0  messages received
         3  signals received
     87732  voluntary context switches
     21964  involuntary context switches

@fwessels
Copy link
Collaborator

Current behaviour is a "quick" implementation, obviously a streaming approach is the correct way to do this. I will create an issue for this.

Regarding 2)

  • magic number could be implemented but would break the naming consistency, and, more importantly, be incompatible with previous use
  • an upper limit would be fairly easy to implement, as well increasing the default 5MB to 100MB or even 500MB or higher. Only caveat is that this does impose a hard limit which you may ultimately run into...

But overall I think the upper limit in combination with increasing the chuck size is the best approach (and not difficult to implement), with BLAKE2 hashing speeds of up to 1GB/sec/core this should still be pretty performant.

@vsivsi
Copy link
Author

vsivsi commented Mar 10, 2017

Hi @fwessels, thanks for your response!

Having spent some time in the code and thinking about this, I'm curious why the "deduped" files need to be as large/complicated as they are.

I'm sure I haven't yet thought of all of the complexities, but isn't the "root hash" alone sufficient to point into the backing store to be able to retrieve the hydrated file?

Other than validating that the deduped file is indeed a deduped file by recreating the root hash, it seems that it would be just as easy to use the root hash to retrieve the chunk hashes from the backing store.

So something like:

s3git snapshot checkout [dir] [commit] --dedup

Would write a file in the target directory for every file in the snapshot that contains only the root hash, and perhaps some kind of check mechanism to distinguish from any other 64 byte file (perhaps appended with a hash of the root hash?). Then all deduped files would be small, very inexpensive to check, and wouldn't need to be transmitted from the server. They could simply be created from the equivalent of s3git snapshot ls [commit].

Retrieving the hydrated contents of a deduped file would then only need to add the extra step of retrieving the chunk hashes before catting them together (as s3git cat [key] already does).

What am I missing? How does storing/revalidating all of the chunk hashes in the deduped files really help?

Thanks!

@fwessels
Copy link
Collaborator

fwessels commented Mar 10, 2017 via email

@vsivsi
Copy link
Author

vsivsi commented Mar 11, 2017

I probably didn't express myself very clearly...

When you clone a snapshot, why are deduped files:

1)

/-----\  /-----\  /-----\  /-----\      /=====\  /=====\
| 0:0 |  | 0:1 |  | 0:2 |  | 0:3 |  ... | 0:N |  | 1:0 |
\-----/  \-----/  \-----/  \-----/      \=====/  \=====/

and not just:

2)

/=====\  /=====\
| 1:0 |  | CHK |
\=====/  \=====/

That way they can be generated and checked in constant time/space.

What advantage is there to dedup format 1) over format 2)?

They seem equivalent in terms of functionality. But 1) is more work and complexity (as this issue demonstrates) when deduped files are never actually hydrated in the clone.

@vsivsi
Copy link
Author

vsivsi commented Mar 11, 2017

To be clear, format 1) would still be used for deduped files in the actual repository (under the root hash key), just not in local clones (when under the original filename).

@fwessels
Copy link
Collaborator

Using 1) you can directly get any portion of the file, eg. if you need (given 5 MB chunks) the range 10-15 MB, you would get the 3rd hash (0:2) and read this chunk out of object storage. Otherwise you would first need to get the (list of) child hashes and only then be able to read the contents.

As for 2): what would CHK be? A checksum?

@vsivsi
Copy link
Author

vsivsi commented Mar 11, 2017

Right, I guess my thought was that if you want all (or part) of the hydrated data, it is no big deal to use the root hash to retrieve that information from the server (what s3git cat [root] already does?)

Using 1), the chunk hashes are effectively cached locally, saving a round trip request when you want file data. For small files, that is a negligible amount of data, and for large files, it is still small relative to all of the requests you will need to do to retrieve all of the needed chunks.

But 1) comes at the cost of:

  • the initial request and local storage for the chunk hashes (probably not too big a deal given a 5MB -> 128byte compression) and
  • needing to rehash every local deduped file anytime you do a commit (O(n*m) work).

Using 2) every dedup file is the same length, and creating dedup clones and commits is extremely cheap (O(n) work).

CHK could be:

  • a checksum (hash of the root hash)
  • the hash of a magic number + the root hash
  • an n-bit error correcting code of the root hash
  • ???

Basically CHK is something inexpensive (and constant time/memory) to verify that this is a valid, uncorrupted dedup file pointer and not a hydrated file of length 256.

So, 1) makes hydrating deduped files marginally cheaper (saves one access per file) at the time of hydration, but at the cost of a performing an extra lookup per file when cloning, a little extra storage, added complexity, and more expensive clone/commit operations.

Option 2) gives deduped files of all sizes the same storage and performance on clone and commit that single chunk files currently have, but at the cost of an extra disk/network request when hydrated file data is needed.

In my opinion, 2) is preferable because it is simpler and performs better in what seems like the most common case for deduped clones: most files will not be hydrated multiple times. If files are hydrated (on average) less than once per clone, then I believe 2) is better in every respect.

The only case where 1) is clearly better is when files will be hydrated many times each from a single clone, and commits are relatively infrequent.

@fwessels
Copy link
Collaborator

Regarding 1)

The amount of data would not be so much the issue, it is more that, for every read access to the object, you would first need to get the hashes for the chunks (since you have no place to store the chunk hashes you would need to do this for every read).

Given huge file sizes, a single GET is probably not always what you want, so making more GETs using HTTP-RANGE makes more sense (and you can do this in parallel, massively speeding up the download -- this is similar to a multi-part-upload but then for a download).

The downside is having to rehash the file upon a commit.

Regarding 2)

I can certainly see the usefulness here for certain scenarios, but I would like for the checksum to break the "CAS" nature of the content. However if the checksum would be another BLAKE2B hash (of the root bash) in hierarchy mode then this would be retained.

And it would be ideal if you could even differentiate this from the format used in scenario 1).

Let me think about that...

@vsivsi
Copy link
Author

vsivsi commented Mar 13, 2017

Over the weekend, I thought of one other issue with the current dedup format. It does not communicate any information about chunk size. That is, if it is valid to chunk files using something other than the current 5MB default (either a different fixed length, or a variable length content defined chunking scheme), then it is not possible to effectively "seek" to the chunk that contains specific data you would like to access. You can't know which chunk it is in without assuming the default fixed chunk size.

@vsivsi
Copy link
Author

vsivsi commented Mar 13, 2017

Also, I should add, if you aren't familiar with them, you should examine how the Dat and Noms projects are addressing very similar issues. They each have somewhat different goals than S3git (and each is more complex for those reasons) but the fundamental underlying architectures are similar, and the functionality of each of those projects significantly overlaps with s3git.

Specifically, each of them structures their data store as a Git-like Merkle tree, each implements Rabin-like content defined chunking, and each uses content addressed storage, with pluggable backends that include S3. They also each provide some kind of "sparse checkout" functionality like the deduped files in s3git.

The Dat project differs from s3git in that it is less focused on providing a "git-like" workflow to its users (it is more Dropbox/GoogleDrive-like), and it is heavily focused on decentralized peer-to-peer sharing of files, rather than cloud-backed repos (although S3 is an optional backend for cloud based peers).

The Noms project is very similar to s3git in that it very much exposes a git-like workflow, and anticipates cloud (and specifically S3-backed) hosting, but it adds to that a rich database-like model for the information being stored which includes the ability to store unstructured binary blobs.

Anyway, I have studied each of these projects in quite a bit of detail, and understand pretty well how they work and why they've made the choices they have. My comments here are informed by that.

However, I'm interested in s3git because it is simpler and the project seems more stable, while still meeting the needs of my immediate project. Dat and Noms are each still developing at a rapid pace, and I can't deploy software that is changing so rapidly without maintaining my own stable fork, which would be a lot of work for such ambitious projects with much larger teams behind them. Anyway, that's my motivation here.

@fwessels
Copy link
Collaborator

Regarding the chunk-size, it is correct that this has to be a "known" property. In case it would ever get lost you could still figure it out by doing a kind of "brute" force attack in order to find the correct root hash.

@fwessels
Copy link
Collaborator

I am familiar that Dat is more a peer-to-peer system, Noms is a newer project that I haven't studied much.

Especially with content defined chunking (#20 (comment)) it would make sense (or is even necessary) to be more flexible in the deduped format.

Also regarding a new deduped format, the BLAKE2 NodeDepth maybe offers something to take advantage of (see https://github.com/s3git/bt2sum/blob/master/bt2sum.go#L67).

Currently the root hash is computed at NodeDepth = 1, if a CHK would be computed over the root hash with a NodeDepth = 2 (unused as of yet) you would get a unique hash key. And you would be able to uniquely detect a deduped file.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants