Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a test set with a hash (Issue 71 was closed) #613

Open
minertom opened this issue Jan 5, 2021 · 1 comment
Open

Creating a test set with a hash (Issue 71 was closed) #613

minertom opened this issue Jan 5, 2021 · 1 comment

Comments

@minertom
Copy link

minertom commented Jan 5, 2021

Hi, I did read issue #71 "Creating test set with hash" and I only had one question concerning your explanation.

During the hashing, only the last byte of the actual hash is considered as the test in order to determine if the data in question belongs to the test set. Yes, the whole hash is a unique value (unless a collision happens). But, only the last byte 0-255 is used as the determinant of belonging in the data set. So, are you saying that because the hashing algorithm provides a "uniform distribution" that 20% of the values that represent the last byte of the hash will be less than 51 (20% of 256)?

Thank You
Tom

BTW, I purchased your book. Love it so far.

@ageron
Copy link
Owner

ageron commented May 4, 2021

Hi @minertom ,

Thanks for your question, and for your kind words (I'm very glad you enjoy my book!).

You guessed right: I'm assuming that the last byte of the hash follows a uniform distribution over all possible byte values, so about 20% will be lower than 51, since 20% is about 51/256. Note that 51/256=19.92%, while 52/256=20.31%, so there's no easy way to get precisely 20% with just one byte. If this granularity is not sufficient, you could convert the whole hash to a very large integer, and check whether it's smaller than 20% of the max possible value. I felt that the added complexity wasn't worth the effort, but as this code has confused quite a few readers, I'm not sure that was a good call.

Anyway, I hope it's clearer now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants