FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

StefanieSenger · 2024-04-28T06:45:33Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This 2016 PR intended to add info_gain and info_gain_ratio functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.

…ions setup

…kar/scikit-learn into ig-and-igr-feature-selection

…nual values; moved IGR tests

…ctions

github-actions · 2024-04-28T06:46:49Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b7d25ac. Link to the linter CI: here}

OmarManzoor

Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?

StefanieSenger · 2024-05-10T11:55:06Z

I have added such a test @OmarManzoor, maybe it helps if one day someone works on the if ratio block in _info_gain(), which is infact the only few lines that differ between both functions.

OmarManzoor

A few minor suggestions otherwise this looks good. Thanks @StefanieSenger

sklearn/feature_selection/_univariate_selection.py

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

StefanieSenger · 2024-05-13T07:35:47Z

Nice, thank you @OmarManzoor

glemaitre

Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.

glemaitre · 2024-05-21T15:30:13Z

doc/modules/feature_selection.rst

+ * For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
+   :func:`f_classif`, :func:`mutual_info_classif`


Suggested change

* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,

:func:`f_classif`, :func:`mutual_info_classif`

* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,

:func:`f_classif`, :func:`mutual_info_classif`.

You can also add a full stop on the line before.

glemaitre · 2024-05-21T15:31:06Z

doc/whats_new/v1.6.rst

+
+- |Feature| :func:`~feature_selection.info_gain` and
+  :func:`~feature_selection.info_gain_ratio` can now be used for
+  univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.


Suggested change

univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.

univariate feature selection.

:pr:`28905` by :user:`Viktor Pekar <vpekar>` and

:user:`Stefanie Senger <StefanieSenger>`.

glemaitre · 2024-05-21T15:34:01Z

sklearn/feature_selection/_univariate_selection.py

+def _get_entropy(prob):
+    t = np.log2(prob)
+    t[~np.isfinite(t)] = 0
+    return np.multiply(-prob, t)


Nowadays, I think this is implemented in scipy.stats.entropy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html

The base here is set to 2 (I have to check if it makes sense or not).

glemaitre · 2024-05-21T18:00:12Z

sklearn/feature_selection/_univariate_selection.py

+    def _a_log_a_div_b(a, b):
+        with np.errstate(invalid="ignore", divide="ignore"):
+            t = np.log2(a / b)
+        t[~np.isfinite(t)] = 0
+        return np.multiply(a, t)


supposidely this could be replaced by the rel_entr from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr

The difference is that we use log2 instead of the log base e in the scipy definition. I have to check.

So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter

glemaitre · 2024-05-21T18:21:41Z

sklearn/feature_selection/_univariate_selection.py

+    c_prob = c_count / c_count.sum()
+    fc_prob = fc_count / total
+
+    c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)


To give an example regarding the base, here it would be equivalent to:

c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)

glemaitre · 2024-05-21T18:29:31Z

examples/feature_selection/plot_compare_feature_selection.py

@@ -0,0 +1,115 @@
+"""


We will probably avoid to have a new example and instead we should edit an existing one.

glemaitre · 2024-05-21T18:30:13Z

sklearn/feature_selection/_univariate_selection.py

+    """Count feature, class, joint and total frequencies
+
+    Returns
+    -------
+    f_count : array, shape = (n_features,)
+    c_count : array, shape = (n_classes,)
+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """


we will need a proper docstring with our new standards

glemaitre · 2024-05-21T18:31:11Z

sklearn/feature_selection/_univariate_selection.py

+    return np.asarray(scores).reshape(-1)
+
+
+def _get_fc_counts(X, y):


Since this is call a single time, we should not need to have a function.

glemaitre · 2024-05-21T18:32:05Z

sklearn/feature_selection/_univariate_selection.py

+        with np.errstate(invalid="ignore", divide="ignore"):
+            scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))
+
+    # the feature score is averaged over classes


I think the comment only apply to the first case

glemaitre · 2024-05-21T18:42:48Z

sklearn/feature_selection/_univariate_selection.py

+    c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob))
+    nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob)
+
+    scores = c_f + nc_nf + c_nf + nc_f


I think that I would prefer _info_gain to return this score and

have the ratio below done in the info_gain_ratio and finally have a function to that could be called twice to just make the reduction.

def _info_gain(X, y): # probably the name of the function should be better. ... return scores, c_prob def info_gain(X, y, aggregate=np.max): return aggregate.reduce(_info_gain(X, y)[0], axis=0) def info_gain_ratio(X, y, aggregate=np.max): scores, c_prob = _info_gain(X, y) with np.errstate(invalid="ignore", divide="ignore"): scores /= (entropy(c_prob) + entropy(1 - c_prob)) return aggregate.reduce(scores, axis=0)

vpekar added 30 commits March 13, 2016 22:02

Added IG and IGR feature selection functions

12300fd

Fixed a broken test

2ce8c92

Merge branch 'master' into ig-and-igr-feature-selection

a0ca2f9

Added an extra return var to conform to other feature selection funct…

1ba5b75

…ions setup

Removed the pvals return param from mi function

e576fe0

Dealing with functions that don't return pvals

97744fe

Removed unused import

2cda7af

Renamed vars, using __future__.division

d7701f2

Moved __future__.division

56eb381

Fixed import error

b4f02f8

Merge branch 'master' into ig-and-igr-feature-selection

39053da

Fixing flake8 errors

201abc4

Merge branch 'ig-and-igr-feature-selection' of https://github.com/vpe…

d453dae

…kar/scikit-learn into ig-and-igr-feature-selection

Added support for dense arrays for ig and igr, added formulas

a7b663f

Removed unused import

4a6a849

Removed unused import

1eb379a

Corrected IGR formula

f4f0517

Updated docstrings

6d55cea

Added info_gain and info_gain_ratio examples

6ad6f7d

Fixed PyFlakes errors

ef48e09

Code refactoring, using safe_sparse_dot on all matrix types

1deb585

Reverted feature_selection.rst

3684364

Using max as the default globalization strategy

fc01086

Updated docstrings and rst documentation

a966d1e

Merge branch 'master' into ig-and-igr-feature-selection

738afc2

Docstrings: links only on titles

8c2a41c

Refactored to calculate IGR inside _info_gain; added tests against ma…

676bbdc

…nual values; moved IGR tests

Removed IGR tests

30ff737

Added an example comparing different univariate feature selection fun…

1b76234

…ctions

Removed IG and IGR from two examples

b21c655

StefanieSenger added 3 commits April 26, 2024 12:15

update example

325cc87

update test

def38dc

sparse containers for testing

c18b303

github-actions bot added the module:feature_selection label Apr 28, 2024

StefanieSenger mentioned this pull request Apr 28, 2024

[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534

Closed

StefanieSenger marked this pull request as draft April 29, 2024 08:40

error corrected docstrings

973caab

StefanieSenger marked this pull request as ready for review April 29, 2024 09:23

StefanieSenger and others added 2 commits May 3, 2024 12:23

added testing for aggretate={'mean', 'sum'}

2b9b6bd

Merge branch 'main' into information_gain

e43c2c5

OmarManzoor reviewed May 6, 2024

View reviewed changes

add test for equally distributed classes

506855c

StefanieSenger and others added 3 commits May 10, 2024 14:00

unfunctional code removed

8f97e01

Merge branch 'main' into information_gain

4bcbf46

update changelog

b6d0481

OmarManzoor approved these changes May 13, 2024

View reviewed changes

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6bc738c

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

StefanieSenger and others added 3 commits May 17, 2024 13:50

Merge branch 'main' into information_gain

4d4b368

resolve merge conflict

3718599

delete classes.rst again

b7d25ac

glemaitre self-requested a review May 21, 2024 15:28

glemaitre reviewed May 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

StefanieSenger commented Apr 28, 2024

github-actions bot commented Apr 28, 2024 •

edited

OmarManzoor left a comment

StefanieSenger commented May 10, 2024

OmarManzoor left a comment •

edited

StefanieSenger commented May 13, 2024

glemaitre left a comment

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

		* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
		:func:`f_classif`, :func:`mutual_info_classif`

		return np.asarray(scores).reshape(-1)


		def _get_fc_counts(X, y):

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Are you sure you want to change the base?

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Conversation

StefanieSenger commented Apr 28, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Apr 28, 2024 • edited

✔️ Linting Passed

OmarManzoor left a comment

Choose a reason for hiding this comment

StefanieSenger commented May 10, 2024

OmarManzoor left a comment • edited

Choose a reason for hiding this comment

StefanieSenger commented May 13, 2024

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 28, 2024 •

edited

OmarManzoor left a comment •

edited