Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full database size #46

Open
cMadan opened this issue Oct 28, 2018 · 3 comments
Open

Full database size #46

cMadan opened this issue Oct 28, 2018 · 3 comments

Comments

@cMadan
Copy link

cMadan commented Oct 28, 2018

I'm doing some analyses based on the appearances data now added, but two specific numbers would be helpful in characterizing the full dataset that these top X appearances are then extracted from.

(1) How many unique passwords (i.e., >=1 appearance) were present in the full database? I.e., the "nearly 13 billion" value, but I would appreciate the specific number.

(2) What is the total number of password appearances in the full database, i.e., the sum of the appearances column across all nearly 13 billion passwords.

@berzerk0
Copy link
Owner

berzerk0 commented Mar 8, 2019

Misread this, I was thinking " appearances > 1" not " >=1"
Those numbers coming in the next few days.
In the meantime...

  • Unique passwords with 2 or more appearances: 8,168,893,389
    8 billion 168 million 893 thousand 389

Sum of appearances was overflowing awk's 32-bit integer so I'll have to get creative there.

I will put these numbers together, but there is a serious caveat to be aware of.
Since these lists are hard to manicure, quality control drops off as appearance values get lower.

Values that appeared fewer than 5 times are far more likely to be garbage data that somehow got into a wordlist as everything got passed around the internet before it ended up in my hands. There is a very high chance that values that appear 1 time are not "password" but are for example, values from a dictionary.

Here we have 10 entries that appear about 1000 lines above the bottom of the list.

  • 1 åberopbart
  • 1 åberopbara
  • 1 åberopbar
  • 1 åberopats
  • 1 åberopat
  • 1 åberopas
  • 1 åberopar
  • 1 åberopandets
  • 1 åberopandet
  • 1 åberopande

These are all dictionary words. Yes, they may be used as passwords, but it is highly likely that one of the large, encyclopedic wordlists contained entire dictionaries. The goal of this project is to include passwords that are common, not to build a large encyclopedic list. This is why I set the cutoff for inclusion to 5 appearances or higher.

Understand that analysis of "passwords" because VERY dubious at low appearance counts

@cMadan
Copy link
Author

cMadan commented Mar 15, 2019

Thanks for the detailed follow-up! My intention definitely isn't to use these passwords with less than 5 occurrences in any sort of analysis, it's more to characterise the size of the database that my analyses are derived from. I hope that makes more sense now :).

@berzerk0
Copy link
Owner

I'll figure out how to put those files together for you soon.
Right now I am very focused on a certification and will get to it after my exam - apologies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants