Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help understanding crc hash used to explain test train split in Chapter 2 #655

Open
paliwalvarad opened this issue May 1, 2022 · 1 comment

Comments

@paliwalvarad
Copy link

paliwalvarad commented May 1, 2022

Hi there, in the following code, I fail to understand how the hash function is working here and how would it make sure that the test set is randomly selected. Can you shed some light on this please? Thank you.

def test_set_check(identifier,test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio *2**32)
@ageron
Copy link
Owner

ageron commented May 5, 2022

Hi @paliwalvarad ,

Sure. The CRC-32 algorithm takes the identifier and converts it to a (somewhat) pseudo-random 32-bit number. In other words, we're using the crc32() function as a hash function: the same identifier will always produce the same CRC-32 "hash", but two different identifiers will generally produce very different (seemingly random) hashes.

The & 0xffffffff can be removed: it doesn't hurt, but it was only needed in Python 2 to convert the output of crc32() from a signed 32-bit integer to an unsigned 32-bit integer.

So now we have a 32-bit signed integer. We check whether it's smaller than test_ratio * 2**32. The largest possible CRC-32 value is 0xffffffff, which is 232 - 1. So if test_ratio is equal to, say, 0.2, then the pseudo-random CRC-32 will only be smaller than 232 about 20% of the time, on average over all possible identifiers.

In short, this function takes an identifier and a test_ratio, and it computes a CRC-32 of the identifier to determine whether or not it belongs to the test set. Each identifier has a probability equal to the test_ratio of being assigned to the test set.

Since each identifier is assigned randomly with probability test_ratio, it is possible that the actual test set size will not be exactly equal to the test_ratio, especially for small datasets. If that's a problem, then you can choose to use the other approaches, such as shuffling the dataset and using the first 20% for testing (for example).

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants