Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

Reference Issues/PRs

closes #6534

What does this implement/fix? Explain your changes.

This 2016 PR intended to add info_gain and info_gain_ratio functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.

Copy link

github-actions bot commented Apr 28, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b7d25ac. Link to the linter CI: here

@StefanieSenger StefanieSenger marked this pull request as ready for review April 29, 2024 09:23
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?

@StefanieSenger
Copy link
Contributor Author

I have added such a test @OmarManzoor, maybe it helps if one day someone works on the if ratio block in _info_gain(), which is infact the only few lines that differ between both functions.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions otherwise this looks good. Thanks @StefanieSenger

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
@StefanieSenger
Copy link
Contributor Author

Nice, thank you @OmarManzoor

@glemaitre glemaitre self-requested a review May 21, 2024 15:28
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.

Comment on lines +92 to +93
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
:func:`f_classif`, :func:`mutual_info_classif`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
:func:`f_classif`, :func:`mutual_info_classif`
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
:func:`f_classif`, :func:`mutual_info_classif`.

You can also add a full stop on the line before.


- |Feature| :func:`~feature_selection.info_gain` and
:func:`~feature_selection.info_gain_ratio` can now be used for
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.
univariate feature selection.
:pr:`28905` by :user:`Viktor Pekar <vpekar>` and
:user:`Stefanie Senger <StefanieSenger>`.

Comment on lines +293 to +296
def _get_entropy(prob):
t = np.log2(prob)
t[~np.isfinite(t)] = 0
return np.multiply(-prob, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays, I think this is implemented in scipy.stats.entropy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html

The base here is set to 2 (I have to check if it makes sense or not).

Comment on lines +301 to +305
def _a_log_a_div_b(a, b):
with np.errstate(invalid="ignore", divide="ignore"):
t = np.log2(a / b)
t[~np.isfinite(t)] = 0
return np.multiply(a, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supposidely this could be replaced by the rel_entr from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr

The difference is that we use log2 instead of the log base e in the scipy definition. I have to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter

c_prob = c_count / c_count.sum()
fc_prob = fc_count / total

c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an example regarding the base, here it would be equivalent to:

c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)

@@ -0,0 +1,115 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably avoid to have a new example and instead we should edit an existing one.

Comment on lines +337 to +345
"""Count feature, class, joint and total frequencies

Returns
-------
f_count : array, shape = (n_features,)
c_count : array, shape = (n_classes,)
fc_count : array, shape = (n_features, n_classes)
total: int
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need a proper docstring with our new standards

return np.asarray(scores).reshape(-1)


def _get_fc_counts(X, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is call a single time, we should not need to have a function.

with np.errstate(invalid="ignore", divide="ignore"):
scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))

# the feature score is averaged over classes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment only apply to the first case

c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob))
nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob)

scores = c_f + nc_nf + c_nf + nc_f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would prefer _info_gain to return this score and

have the ratio below done in the info_gain_ratio and finally have a function to that could be called twice to just make the reduction.

def _info_gain(X, y):
    # probably the name of the function should be better.
    ...
    return scores, c_prob

def info_gain(X, y, aggregate=np.max):
    return aggregate.reduce(_info_gain(X, y)[0], axis=0)

def info_gain_ratio(X, y, aggregate=np.max):
    scores, c_prob = _info_gain(X, y)
    with np.errstate(invalid="ignore", divide="ignore"):
        scores /= (entropy(c_prob) + entropy(1 - c_prob))
    return aggregate.reduce(scores, axis=0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants