Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List not filtered properly #29

Open
interlocuteur opened this issue Sep 8, 2017 · 5 comments
Open

List not filtered properly #29

interlocuteur opened this issue Sep 8, 2017 · 5 comments
Assignees
Labels

Comments

@interlocuteur
Copy link

The 258Million list has not been filtered properly. It contain a lot of HTML tags like and .

@berzerk0
Copy link
Owner

Guess this one slipped by me, do you have a specific example?

It's possible these are legitimately being used as passwords - but that's very unlikely.

@berzerk0 berzerk0 self-assigned this Jan 30, 2018
@interlocuteur
Copy link
Author

I don't have the file anymore but you can search for angled brackets "<" and ">"

@berzerk0
Copy link
Owner

This is tricky.
I can't be sure of the origin of those lines - they might be both html tags and passwords.

@berzerk0
Copy link
Owner

For Release 2.0, I erred on the side of inclusivity.

Their are lines that look a lot like code, specifically html tags. The same is true for some email addresses. In many cases, these lines appeared in over 15 files in analysis, suggesting they are in fact passwords. This logic is not definitive, however.

All of the source files on the list were already published, so this information is already available to the internet. With this in mind, I opted to include these lines. Most questionable lines do not appear until the list is already quite large.

This issue will remain open and we'll meditate upon it.

@berzerk0
Copy link
Owner

berzerk0 commented Feb 22, 2018

Troy Hunt's take on the problem.

Of course, it's possible people actually used these strings as passwords but applying a bit of Occam's Razor suggests that it's simply parsing issues upstream of this data set.

Frankly though, there's little point in removing a few million junk strings. It reduced the overall data size of [Troy's Pwned Passwords V2] by 0.69% and other than the tiny fraction of extra bytes added to the set, it makes no practical difference to how the data is used.

While it is highly likely that these aren't passwords, the very idea that they are not is based on assumption we have a good handle on what passwords are. This assumption, for the most part, is true.

However, INTENTIONALLY making passwords that don't look like passwords isn't without merit. I once worked at a company where we had reason to believe that keyloggers were installed on our systems. I had no idea what to with this information, but it really bothered me. To cope with this, I came up with an idea to use the on-screen keyboard to create a password that looked like a URL.

Certainly, I can't be the only one to come up with the idea of making a password that contains some sort of camouflage. It is still most definitely more likely that these are simple "upstream parsing" issues, including them has such a small impact on list performance. I say they are worth keeping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants