Skip to content

Latest commit

 

History

History
75 lines (56 loc) · 2.95 KB

eda.md

File metadata and controls

75 lines (56 loc) · 2.95 KB

Exploratory data analysis

Feature importance

The feature_importance method returns two dataframes that contain feature importance metrics that depend on the types of the feature/target

Feature/Task Classification Regression
Categorical Chi²-test + Cramér's V F-test
Numerical F-test Pearson correlation

Additionally mutual information can be used in each case.

Classification.

>>> import pandas as pd
>>> from sklearn import datasets
>>> import xam

>>> iris = datasets.load_iris()
>>> features = pd.DataFrame(iris.data, columns=iris.feature_names)
>>> features['sepal length (cm)'] = features['sepal length (cm)'] > 5.5
>>> target = pd.Series(iris.target)

>>> cont_imp, disc_imp = xam.eda.feature_importance_classification(features, target, random_state=1)

>>> cont_imp.sort_values('f_p_value')  # doctest: +SKIP
                   f_statistic     f_p_value  mutual_information
petal length (cm)  1179.034328  3.051976e-91            0.990061
petal width (cm)    959.324406  4.376957e-85            0.977279
sepal width (cm)     47.364461  1.327917e-16            0.256295

>>> disc_imp.sort_values('chi2_p_value')  # doctest: +SKIP
                   chi2_statistic  chi2_p_value  cramers_v  mutual_information
sepal length (cm)        98.11883  4.940452e-22   0.803139            0.386244

Regression.

>>> import pandas as pd
>>> from sklearn import datasets
>>> import xam

>>> boston = datasets.load_boston()
>>> features = pd.DataFrame(boston.data, columns=boston.feature_names)
>>> features['CHAS'] = features['CHAS'].astype(int)
>>> target = pd.Series(boston.target)

>>> cont_imp, disc_imp = xam.eda.feature_importance_regression(features, target, random_state=1)

>>> cont_imp.sort_values('pearson_r_p_value')
         pearson_r  pearson_r_p_value  mutual_information
LSTAT    -0.737663       5.081103e-88            0.666882
RM        0.695360       2.487229e-74            0.526456
PTRATIO  -0.507787       1.609509e-34            0.453291
INDUS    -0.483725       4.900260e-31            0.471507
TAX      -0.468536       5.637734e-29            0.363694
NOX      -0.427321       7.065042e-24            0.456947
CRIM     -0.388305       1.173987e-19            0.339019
RAD      -0.381626       5.465933e-19            0.217623
AGE      -0.376955       1.569982e-18            0.311285
ZN        0.360445       5.713584e-17            0.195153
B         0.333461       1.318113e-14            0.161861
DIS       0.249929       1.206612e-08            0.295207

>>> disc_imp.sort_values('mutual_information')
      f_statistic  f_p_value  mutual_information
CHAS    15.971512   0.000074            0.030825