New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add Information Gain and Information Gain Ratio feature selection functions #28905
base: main
Are you sure you want to change the base?
FEA Add Information Gain and Information Gain Ratio feature selection functions #28905
Conversation
…kar/scikit-learn into ig-and-igr-feature-selection
…nual values; moved IGR tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?
I have added such a test @OmarManzoor, maybe it helps if one day someone works on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions otherwise this looks good. Thanks @StefanieSenger
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Nice, thank you @OmarManzoor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`, | ||
:func:`f_classif`, :func:`mutual_info_classif` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`, | |
:func:`f_classif`, :func:`mutual_info_classif` | |
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`, | |
:func:`f_classif`, :func:`mutual_info_classif`. |
You can also add a full stop on the line before.
|
||
- |Feature| :func:`~feature_selection.info_gain` and | ||
:func:`~feature_selection.info_gain_ratio` can now be used for | ||
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`. | |
univariate feature selection. | |
:pr:`28905` by :user:`Viktor Pekar <vpekar>` and | |
:user:`Stefanie Senger <StefanieSenger>`. |
def _get_entropy(prob): | ||
t = np.log2(prob) | ||
t[~np.isfinite(t)] = 0 | ||
return np.multiply(-prob, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowadays, I think this is implemented in scipy.stats.entropy
: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html
The base
here is set to 2
(I have to check if it makes sense or not).
def _a_log_a_div_b(a, b): | ||
with np.errstate(invalid="ignore", divide="ignore"): | ||
t = np.log2(a / b) | ||
t[~np.isfinite(t)] = 0 | ||
return np.multiply(a, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supposidely this could be replaced by the rel_entr
from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr
The difference is that we use log2
instead of the log base e in the scipy definition. I have to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter
c_prob = c_count / c_count.sum() | ||
fc_prob = fc_count / total | ||
|
||
c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give an example regarding the base, here it would be equivalent to:
c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)
@@ -0,0 +1,115 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably avoid to have a new example and instead we should edit an existing one.
"""Count feature, class, joint and total frequencies | ||
|
||
Returns | ||
------- | ||
f_count : array, shape = (n_features,) | ||
c_count : array, shape = (n_classes,) | ||
fc_count : array, shape = (n_features, n_classes) | ||
total: int | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need a proper docstring with our new standards
return np.asarray(scores).reshape(-1) | ||
|
||
|
||
def _get_fc_counts(X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is call a single time, we should not need to have a function.
with np.errstate(invalid="ignore", divide="ignore"): | ||
scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob)) | ||
|
||
# the feature score is averaged over classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment only apply to the first case
c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob)) | ||
nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob) | ||
|
||
scores = c_f + nc_nf + c_nf + nc_f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would prefer _info_gain
to return this score and
have the ratio below done in the info_gain_ratio
and finally have a function to that could be called twice to just make the reduction.
def _info_gain(X, y):
# probably the name of the function should be better.
...
return scores, c_prob
def info_gain(X, y, aggregate=np.max):
return aggregate.reduce(_info_gain(X, y)[0], axis=0)
def info_gain_ratio(X, y, aggregate=np.max):
scores, c_prob = _info_gain(X, y)
with np.errstate(invalid="ignore", divide="ignore"):
scores /= (entropy(c_prob) + entropy(1 - c_prob))
return aggregate.reduce(scores, axis=0)
Reference Issues/PRs
closes #6534
What does this implement/fix? Explain your changes.
This 2016 PR intended to add
info_gain
andinfo_gain_ratio
functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.