List not filtered properly #29

interlocuteur · 2017-09-08T20:56:46Z

The 258Million list has not been filtered properly. It contain a lot of HTML tags like and .

berzerk0 · 2018-01-30T00:42:59Z

Guess this one slipped by me, do you have a specific example?

It's possible these are legitimately being used as passwords - but that's very unlikely.

interlocuteur · 2018-01-30T20:51:26Z

I don't have the file anymore but you can search for angled brackets "<" and ">"

berzerk0 · 2018-02-15T23:00:40Z

This is tricky.
I can't be sure of the origin of those lines - they might be both html tags and passwords.

berzerk0 · 2018-02-20T23:41:43Z

For Release 2.0, I erred on the side of inclusivity.

Their are lines that look a lot like code, specifically html tags. The same is true for some email addresses. In many cases, these lines appeared in over 15 files in analysis, suggesting they are in fact passwords. This logic is not definitive, however.

All of the source files on the list were already published, so this information is already available to the internet. With this in mind, I opted to include these lines. Most questionable lines do not appear until the list is already quite large.

This issue will remain open and we'll meditate upon it.

berzerk0 · 2018-02-22T14:20:54Z

Troy Hunt's take on the problem.

Of course, it's possible people actually used these strings as passwords but applying a bit of Occam's Razor suggests that it's simply parsing issues upstream of this data set.

Frankly though, there's little point in removing a few million junk strings. It reduced the overall data size of [Troy's Pwned Passwords V2] by 0.69% and other than the tiny fraction of extra bytes added to the set, it makes no practical difference to how the data is used.

While it is highly likely that these aren't passwords, the very idea that they are not is based on assumption we have a good handle on what passwords are. This assumption, for the most part, is true.

However, INTENTIONALLY making passwords that don't look like passwords isn't without merit. I once worked at a company where we had reason to believe that keyloggers were installed on our systems. I had no idea what to with this information, but it really bothered me. To cope with this, I came up with an idea to use the on-screen keyboard to create a password that looked like a URL.

Certainly, I can't be the only one to come up with the idea of making a password that contains some sort of camouflage. It is still most definitely more likely that these are simple "upstream parsing" issues, including them has such a small impact on list performance. I say they are worth keeping.

berzerk0 self-assigned this Jan 30, 2018

berzerk0 added the wontfix label Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List not filtered properly #29

List not filtered properly #29

interlocuteur commented Sep 8, 2017

berzerk0 commented Jan 30, 2018

interlocuteur commented Jan 30, 2018

berzerk0 commented Feb 15, 2018

berzerk0 commented Feb 20, 2018

berzerk0 commented Feb 22, 2018 •

edited

List not filtered properly #29

List not filtered properly #29

Comments

interlocuteur commented Sep 8, 2017

berzerk0 commented Jan 30, 2018

interlocuteur commented Jan 30, 2018

berzerk0 commented Feb 15, 2018

berzerk0 commented Feb 20, 2018

berzerk0 commented Feb 22, 2018 • edited

berzerk0 commented Feb 22, 2018 •

edited