nan score for StackingClassifier due to 'scoring' argument in cross_val_score #1059

kemaldahha · 2023-08-01T07:29:48Z

Hi, I try to run the code below (Example 1 from the StackingClassifier documentation):

from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
import numpy as np
import warnings

warnings.simplefilter('ignore')

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

I get the following output:

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: nan (+/- nan) [StackingClassifier]

The expected output is that the score for StackingClassifier should be a number like:

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.95 (+/- 0.02) [StackingClassifier]

When I print the warning by commenting out warnings.simplefilter('ignore'), I get the output below (I truncated it, as the warning is repeated several times):

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
[c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\model_selection\_validation.py:842](file:///C:/projects/machine-learning-matt-harrison/env/lib/site-packages/sklearn/model_selection/_validation.py:842): UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__
    score = scorer._score(
  File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score
    y_pred = method_caller(estimator, "predict", X)
  File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call
    result, _ = _get_response_values(
  File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\utils\_response.py", line 74, in _get_response_values
    classes = estimator.classes_
AttributeError: 'StackingClassifier' object has no attribute 'classes_'

The problem seems to be related to the scoring argument in scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy'). If I remove that argument, then the default scoring is used (accuracy, I think), and then I get the expected output which is the same as in the example in the documentation:

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.95 (+/- 0.02) [StackingClassifier]

However I would like to be able to use other scoring metrics as well (e.g. roc_auc), but then I have to provide the argument explicitly and I get the nan score again for StackingClassifier.

I already checked issues #423 and #426, which mention a similar warning/error (AttributeError: 'StackingClassifier' object has no attribute 'classes_'), but I couldn't figure it out based on those issues.

I am using:

Python 3.10.0
scikit-learn==1.3.0
mlxtend==0.22.0

The text was updated successfully, but these errors were encountered:

rasbt · 2023-08-01T11:34:24Z

Thanks for the note! I can confirm, having this issue in sklearn 1.3.0 as well (but not in 1.2.2). I just submitted a PR via #1060 to fix that

kemaldahha · 2023-08-01T11:34:49Z

I came across this lecture by @rasbt. Based on his explanation StackingClassifier was included in sklearn. I adjusted the code to use the sklearn version of StackingClassifier:

from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
# from mlxtend.classifier import StackingClassifier
import numpy as np
import warnings

warnings.simplefilter('ignore')

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

estimators = [("clf1", clf1),
              ("clf2", clf2),
              ("clf3", clf3)]

lr = LogisticRegression()

sclf = StackingClassifier(estimators=estimators, 
                          final_estimator=lr)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring="accuracy")
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

Now I do get an output more in line with what I expect, though not exactly same as in the mlxtend StackingClassifier documentation (Example 1):

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.93 (+/- 0.02) [StackingClassifier]

Perhaps sklearn's StackingClassifier implementation is different from mlxtend's.

I am wondering whether we should still use mlxtend's StackingClassifier or whether it is deprecated and we should use sklearn's implementation instead?

kemaldahha · 2023-08-01T11:36:38Z

Thanks for the note! I can confirm, having this issue in sklearn 1.3.0 as well (but not in 1.2.2). I just submitted a PR via #1060 to fix that

Thanks for the reply. I posted my second comment before I read your reply, apologies.

rasbt mentioned this issue Aug 1, 2023

add classes_ #1060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan score for StackingClassifier due to 'scoring' argument in cross_val_score #1059

nan score for StackingClassifier due to 'scoring' argument in cross_val_score #1059

kemaldahha commented Aug 1, 2023

rasbt commented Aug 1, 2023

kemaldahha commented Aug 1, 2023

kemaldahha commented Aug 1, 2023

nan score for StackingClassifier due to 'scoring' argument in cross_val_score #1059

nan score for StackingClassifier due to 'scoring' argument in cross_val_score #1059

Comments

kemaldahha commented Aug 1, 2023

rasbt commented Aug 1, 2023

kemaldahha commented Aug 1, 2023

kemaldahha commented Aug 1, 2023