transform_feature_names for decomposition and scaling #208

jnothman · 2017-05-23T23:27:52Z

Present top linear weights as PCA2:=(x0*5.3+x1*77) etc.

Issues:

others may feel this naming convention is too verbose. Suggestions welcome
make_tfn_weighted should probably have a doctest (but I'm not sure how to do this without modifying the global register of transform_feature_names implementations)
tests may be too slow, particularly those using hypothesis. Suggestions welcome.

Present top linear weights as PCA2:=(x0*5.3+x1*77) etc.

jnothman · 2017-05-23T23:35:08Z

In merging with master, I see that there has been an attempt to more thoroughly test every estimator with transform_feature_names. Where multiple are registered identically, I believe this is akin to testing every value in a for loop and is excessive. I would rather leave the tests readable.

codecov-io · 2017-05-23T23:40:32Z

Codecov Report

Merging #208 into master will decrease coverage by 11.19%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #208      +/-   ##
=========================================
- Coverage    97.4%   86.2%   -11.2%     
=========================================
  Files          41      41              
  Lines        2580    2617      +37     
  Branches      495     505      +10     
=========================================
- Hits         2513    2256     -257     
- Misses         35     329     +294     
  Partials       32      32

Impacted Files	Coverage Δ
eli5/sklearn/transform.py	`100% <100%> (ø)`	⬆️
eli5/xgboost.py	`4.09% <0%> (-95.33%)`	⬇️
eli5/lightgbm.py	`4.2% <0%> (-94.98%)`	⬇️
eli5/_decision_path.py	`43.47% <0%> (-56.53%)`	⬇️
eli5/_feature_names.py	`94.33% <0%> (-2.84%)`	⬇️
eli5/sklearn/utils.py	`88.05% <0%> (-1.8%)`	⬇️
eli5/sklearn/explain_prediction.py	`97.74% <0%> (-0.13%)`	⬇️
eli5/sklearn/explain_weights.py	`100% <0%> (ø)`	⬆️
eli5/formatters/html.py	`100% <0%> (ø)`	⬆️

kmike · 2017-05-24T10:14:47Z

Hey @jnothman! I'll check this PR later today.

As for excessive testing, I like testing of all (most) estimators explicitly because of two reasons:

this way we declare explicitly what's supported and what's not supported;
it helps to find bugs and wrong assumptions.

For example, RandomizedLogisticRegression has SelectorMixin as a metaclass, not as a base class, so #177 didn't work for it; I found this when adding all these estimators (which I was thinking are supported) one-by-one. Another example is lightning support: when adding all estimators one-by-one I found several inconsistencies which required fixing in lightning: scikit-learn-contrib/lightning#95, scikit-learn-contrib/lightning#98.

I can see how from a library author (sklearn, etc.) point of view these tests look unnecessary. But they require knowledge of inner workings of the library, and knowledge of development plans - e.g. SelectorMixin is not documented, it is not a part of public scikit-learn API; as a scikit-learn core developer you are comfortable relying on it, but as an outsider I like to be more explicit in what eli5 supports, and check one more time that the library interface is consistent and there are no gotchas and hidden issues. In case of #177 it was also a learning excercise for me - I was checking what this PR means for the user.

jnothman · 2017-05-24T10:32:50Z

The issue with randomized l1 is not that SelectorMixin is a metaclass; it is not. It's that SelectorMixin was only made a base class of it recently (i.e. not in a released version). You're probably right. But in testing everything, you will also need to manage deprecations and newcomers in order to support multiple versions. It will be maintenance hell in time...

…

On 24 May 2017 at 20:14, Mikhail Korobov ***@***.***> wrote: Hey @jnothman <https://github.com/jnothman>! I'll check this PR later today. As for excessive testing, I like testing of all (most) estimators explicitly because of two reasons: 1. this way we declare explicitly what's supported and what's not supported; 2. it helps to find bugs and wrong assumptions. For example, RandomizedLogisticRegression has SelectorMixin as a metaclass, not as a base class, so #177 <#177> didn't work for it; I found this when adding all these estimators (which I was thinking are supported) one-by-one. Another example is lightning support: when adding all estimators on-by-one I found several inconsistencies which required fixing in lightning: scikit-learn-contrib/lightning#95 <scikit-learn-contrib/lightning#95>, scikit-learn-contrib/lightning#98 <scikit-learn-contrib/lightning#98>. I can see how from a library author (sklearn, etc.) point of view these tests look unnecessary. But they require knowledge of inner workings of the library, and knowledge of development plans - e.g. SelectorMixin is not documented, it is not a part of public scikit-learn API; as a scikit-learn core developer you are comfortable relying on it, but as an outsider I like to be more explicit in what eli5 supports, and check one more time that the library interface is consistent and there are no gotchas and hidden issues. In case of #177 <#177> it was also a learning excercise for me - I was checking what this PR means for the user. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#208 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-iO1tYb_juMiwbiKIlTao6DeK6kks5r9AMXgaJpZM4Nkan0> .

kmike · 2017-05-24T12:56:42Z

A good catch about randomized l1!

Yeah, deprecations can be a problem. We "solved" it once by dropping sklearn 0.17 support, and likely it'll cause problems again with 0.19. But to my taste it is a better problem to have: it is testing-only, and it is visible; if we don't test then problem can be in user applications, and it can be undetected by our testing suite. But yeah, explicit estimator support increases effort required for maintenance.

jnothman · 2017-05-24T13:20:39Z

I've not worked out how to fix mypy's complaint that

eli5/sklearn/transform.py:176: error: Argument 1 has incompatible type "make_tfn_weighted"; expected Callable[..., Any]

kmike · 2017-05-24T13:29:41Z

@jnothman yep, mypy still has a lot of rough edges - python/mypy#797. You can ignore such errors like this:

transform_feature_names.register(TfidfTransformer)(  # type: ignore
    make_tfn_weighted('idf_', func_name='TFIDF', show_idx=False))

kmike · 2017-05-29T13:59:33Z

Hey @jnothman,

I've tried it with LSI:

vec = CountVectorizer()
tfidf = TfidfTransformer()
svd = TruncatedSVD(n_components=50)
clf = LogisticRegressionCV()
pipe = make_pipeline(vec, tfidf, svd, clf)

eli5.explain_weigths shows features like this now:

SVD2:=(TFIDF(keith*3.75)*0.362+TFIDF(caltech*3.99)*0.27+TFIDF(livesey*4.55)*0.223)

Unlike TfidfVectorizer, it shows idf weights, which is nice; I wonder if we should use idf_ as coef_scale by default for TfidfVectorizer.

What I found confusing is TFIDF(keith*3.75) notation: it reads like "take a number of words "keith" in document, multiply by 3.75, then compute TF*IDF for it (?)", while in fact IDF is 3.75, so it is TF(keith)*3.75. If the goal is not to show a math formula, but to show transformer name, then what do you think about using some other notation, e.g. TFIDF[...] or something else. Ideas?

jnothman · 2017-05-29T22:23:51Z

I agree that it looks like function application, which is wrong. Square or other brackets help a little. Not sure what alternatives are On 29 May 2017 11:59 pm, "Mikhail Korobov" <notifications@github.com> wrote: Hey @jnothman <https://github.com/jnothman>, I've tried it with LSI: vec = CountVectorizer() tfidf = TfidfTransformer() svd = TruncatedSVD(n_components=50) clf = LogisticRegressionCV() pipe = make_pipeline(vec, tfidf, svd, clf) eli5.explain_weigths shows features like this now: SVD2:=(TFIDF(keith*3.75)*0.362+TFIDF(caltech*3.99)*0.27+TFIDF(livesey*4.55)*0.223) Unlike TfidfVectorizer, it shows idf weights, which is nice; I wonder if we should use idf_ as coef_scale by default for TfidfVectorizer. What I found confusing is TFIDF(keith*3.75) notation: it reads like "take a number of words "keith" in document, multiply by 3.75, then compute TF*IDF for it (?)", while in fact IDF is 3.75, so it is TF(keith)*3.75. If the goal is not to show a math formula, but to show transformer name, then what do you think about using some other notation, e.g. TFIDF[...] or something else. Ideas? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#208 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66z_--jYjN1-GIOa1eGK6s3xs5A3ks5r-s9GgaJpZM4Nkan0> .

kmike · 2017-06-21T15:55:15Z

FTR: tests are failing because of #217.

jnothman · 2017-06-29T02:56:25Z

I'd really like to hear some feedback on this, as I feel linear transformation is a very common case, but it is admittedly hard to present nicely as a string.

jnothman · 2017-06-29T02:58:41Z

Otoh if you feel like it's stretching the scope of eli5 too wide, I'm happy to make a separate project of it.

kmike · 2017-06-29T11:10:10Z

@jnothman this PR looks great, I'm looking forward to using it myself. My only reservation is TF(...) which looks like a function is applied. It seems the easiest way to fix it is to make formatting in make_tfn_weighted customizable, and customize it for TfidfTransformer.

jnothman · 2017-06-29T11:53:37Z

well maybe that bracketing is wrong in all the cases. something less loaded like angle brackets?

…

On 29 Jun 2017 9:10 pm, "Mikhail Korobov" ***@***.***> wrote: @jnothman <https://github.com/jnothman> this PR looks great, I'm looking forward to using it myself. My only reservation is tf(...) which looks like a function is applied. It seems the easiest way to fix it is to make formatting in make_tfn_weighted customizable, and customize it for TfidfTransformer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#208 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz64WG-kWZ5ajqwkgzWq67u0eyMh9Yks5sI4YSgaJpZM4Nkan0> .

kmike · 2017-06-29T12:48:13Z

@jnothman yes, sounds fine.

I tried <> as well as few other options ({}, []), they all have disadvantages - {} looks like a set notation, [] looks like we're showing a vector, <> may be confused with greater/less signs. But <> looks less ambiguous, it is no worse than alternatives:

SVD2:=<TFIDF<keith*3.75>*0.362+TFIDF<caltech*3.99>*0.27+TFIDF<livesey*4.55>*0.223>
SVD2:={TFIDF{keith*3.75}*0.362+TFIDF{caltech*3.99}*0.27+TFIDF{livesey*4.55}*0.223}
SVD2:=[TFIDF[keith*3.75]*0.362+TFIDF[caltech*3.99]*0.27+TFIDF[livesey*4.55]*0.223]
SVD2:=(TFIDF(keith*3.75)*0.362+TFIDF(caltech*3.99)*0.27+TFIDF(livesey*4.55)*0.223)

PCA2:={x0*5.3+x1*77}
PCA2:=<x0*5.3+x1*77>
PCA2:=[x0*5.3+x1*77]
PCA2:=(x0*5.3+x1*77)

Maybe spaces around + can help a bit with readability, what do you think?

SVD2:=<TFIDF<keith*3.75>*0.362 + TFIDF<caltech*3.99>*0.27 + TFIDF<livesey*4.55>*0.223>
PCA2:=<x0*5.3 + x1*77>

kmike · 2017-06-29T12:50:57Z

Does it make sense to keep () for := notation?

SVD2:=(TFIDF<keith*3.75>*0.362 + TFIDF<caltech*3.99>*0.27 + TFIDF<livesey*4.55>*0.223)
PCA2:=(x0*5.3 + x1*77)

Also, should we indicate that some terms are missing (if they are missing), e.g. PCA2:=(x0*5.3 + x1*77 + …)?

jnothman · 2017-06-29T13:38:59Z

You're right < is going to break when we support discretizing.

…

On 29 June 2017 at 22:50, Mikhail Korobov ***@***.***> wrote: Does it make sense to keep () for := notation? SVD2:=(TFIDF<keith*3.75>*0.362 + TFIDF<caltech*3.99>*0.27 + TFIDF<livesey*4.55>*0.223) PCA2:=(x0*5.3 + x1*77) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#208 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz600V1a5eDwHXyueTfk_3asXQ38Tgks5sI52xgaJpZM4Nkan0> .

kmike · 2017-06-30T13:48:41Z

What do you think about the following?

add spaces around + signs;
add '...' if there are missing terms in a sum, e.g. PCA2:=(x0*5.3 + x1*77 + ...);
keep () braces for everything except for tf*idf;
change tf*idf formatting. Options: TFIDF/keith*3.75/, TF(keith)*3.75, TF<keith*3.5>, TFIDF[keith*3.75], TFIDF:(keith*3.75).

(4) is what I'm less certain in. Maybe we should just pick one of the options even if it is not perfect to move it forward; it can be fixed later if someone comes with a better idea.

jnothman · 2017-07-09T12:36:00Z

TF(keith)*3.75 would make sense... But maybe it belongs in a different function and I've loaded too much on here.

jnothman · 2017-07-09T12:51:59Z

With the ... idea, do you want to see this ellipsis only when top is the reason for limiting results, or even when threshold is limiting results?

jnothman · 2017-07-09T12:53:34Z

I've not put in ellipsis previously because I considered brevity to be a virtue in feature names.

kmike · 2017-07-09T13:20:50Z

@jnothman Hm, I was thinking about both top and threshold. We shouldn't show "+ …" if the rest of features are zero though, and if you think of threshold as of a way to filter out near-zero components I see how it can be different from top.

PCA2:=(x0*0.325 + x1*0.221 + …) is a bit longer, but e.g. for

SVD2:=(TFIDF(keith)*3.75*0.362 + TFIDF(caltech)*3.99*0.27 + TFIDF(livesey)*4.55*0.223 + …)

it doesn't matter.

It bothers me a bit that these are displaying options, so they'd be better handled at formatting level. For example, the rest of terms in SVD can be displayed on mouse hover in HTML. But it looks outside the scope of this pull request; it may require a structured representation of feature names, instead of plain strings everywhere.

We had this issue with HashingVectorizer, where we needed to store signs of possible feature names, in addition to names themselves; that's why formatters accept feature names which are lists of {'name': name, 'sign': sign} dicts.

jnothman added 3 commits May 24, 2017 09:19

transform_feature_names for decomposition

541ad8a

Present top linear weights as PCA2:=(x0*5.3+x1*77) etc.

Docs

3f34148

Merge branch 'master' into transform-weighted

66fdefe

Note inefficiency

f89d2f8

jnothman added 2 commits May 24, 2017 23:19

Test all the decompositions

09c72fd

Ignore scipy.sparse types

158d0a6

Ignore more types

d71442f

Remove the error case handled by _get_feature_names

0e34608

jnothman force-pushed the transform-weighted branch from 6ee3bd5 to 0e34608 Compare June 25, 2017 21:44

kmike mentioned this pull request Aug 1, 2017

transform_feature_names for scalers #229

Merged

kmike mentioned this pull request Dec 12, 2017

Transformative get_feature_names for various transformers scikit-learn/scikit-learn#6425

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transform_feature_names for decomposition and scaling #208

transform_feature_names for decomposition and scaling #208

jnothman commented May 23, 2017

jnothman commented May 23, 2017

codecov-io commented May 23, 2017 •

edited

kmike commented May 24, 2017 •

edited

jnothman commented May 24, 2017 via email

kmike commented May 24, 2017

jnothman commented May 24, 2017

kmike commented May 24, 2017 •

edited

kmike commented May 29, 2017

jnothman commented May 29, 2017 via email

kmike commented Jun 21, 2017

jnothman commented Jun 29, 2017

jnothman commented Jun 29, 2017

kmike commented Jun 29, 2017 •

edited

jnothman commented Jun 29, 2017 via email

kmike commented Jun 29, 2017

kmike commented Jun 29, 2017 •

edited

jnothman commented Jun 29, 2017 via email

kmike commented Jun 30, 2017

jnothman commented Jul 9, 2017

jnothman commented Jul 9, 2017

jnothman commented Jul 9, 2017

kmike commented Jul 9, 2017

transform_feature_names for decomposition and scaling #208

Are you sure you want to change the base?

transform_feature_names for decomposition and scaling #208

Conversation

jnothman commented May 23, 2017

jnothman commented May 23, 2017

codecov-io commented May 23, 2017 • edited

Codecov Report

kmike commented May 24, 2017 • edited

jnothman commented May 24, 2017 via email

kmike commented May 24, 2017

jnothman commented May 24, 2017

kmike commented May 24, 2017 • edited

kmike commented May 29, 2017

jnothman commented May 29, 2017 via email

kmike commented Jun 21, 2017

jnothman commented Jun 29, 2017

jnothman commented Jun 29, 2017

kmike commented Jun 29, 2017 • edited

jnothman commented Jun 29, 2017 via email

kmike commented Jun 29, 2017

kmike commented Jun 29, 2017 • edited

jnothman commented Jun 29, 2017 via email

kmike commented Jun 30, 2017

jnothman commented Jul 9, 2017

jnothman commented Jul 9, 2017

jnothman commented Jul 9, 2017

kmike commented Jul 9, 2017

codecov-io commented May 23, 2017 •

edited

kmike commented May 24, 2017 •

edited

kmike commented May 24, 2017 •

edited

kmike commented Jun 29, 2017 •

edited

kmike commented Jun 29, 2017 •

edited