[R-package] lgb.cv() fails with categorical features #6412

MathiasAmbuehl · 2024-04-11T12:51:26Z

Description

Executing a cross-validation with lgb.cv() fails when the data contains categorical features.

Reproducible example

I'm using the example in the code demo categorical_features_rules:

demo("categorical_features_rules", "lightgbm")  # executes the example

In that example, a lgb.Booster is trained with

model <- lgb.train(
    params = params
    , data = dtrain
    , nrounds = 100L
    , valids = list(train = dtrain, valid = dtest)
)

after the training and test data had been previously created as

dtrain <- lgb.Dataset(
    data = my_data_train
    , label = bank_train$y
    , categorical_feature = c(2L, 3L, 4L, 5L, 7L, 8L, 9L, 11L, 16L)
)
dtest <- lgb.Dataset.create.valid(
    dtrain
    , data = my_data_test
    , label = bank_test$y
)

Works perfectly so far. Now, if I want to cross-validate using the training data set, by doing

lgb.cv(
  params = params
  , data = dtrain
  , nrounds = 100L)

I'm getting an error

Error in if (data_is_not_filename && max(private$categorical_feature) >  : 
  missing value where TRUE/FALSE needed

Specifying the categorical_feature argument again in lgb.cv() does not help.

The error seems to be related to categorical features. When I'm, running a similar analysis without them, lgb.cv() works fine:

dtrain_nocat <- lgb.Dataset(
  data = my_data_train
  , label = bank_train$y
  # no categorical_feature here!
)
lgb.cv(
  params = params
  , data = dtrain_nocat
  , nrounds = 100L)

Environment info

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Switzerland.utf8  LC_CTYPE=German_Switzerland.utf8    LC_MONETARY=German_Switzerland.utf8
[4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8 lightgbm_4.3.0   

loaded via a namespace (and not attached):
[1] compiler_4.3.1    R6_2.5.1          Matrix_1.6-4      parallel_4.3.1    tools_4.3.1       rstudioapi_0.15.0
[7] grid_4.3.1        jsonlite_1.8.5    lattice_0.21-8

The text was updated successfully, but these errors were encountered:

jmoralez · 2024-04-16T17:31:25Z

Seems like I introduced this in #5184. I don't remember why I changed the condition from length(private$colnames) to ncol(private$raw_data). I'll investigate further.

pimlabee · 2024-04-22T12:54:07Z

I experienced the same issue, but it seems like it does not throw the error (i.e. it seems to work) when you specify the 'categorical_feature' vector to be a vector of strings, the names of your features.

i.e. cat_features = c("a","b","c").

I had no issues when I included this both in the train creating using lgb.Dataset() AND in lgb.cv(categorical_feature=cat_features).

Hope this helps.

MathiasAmbuehl · 2024-04-22T13:35:25Z

Yes, that's a useful workaround that circumvents the bug in my example, too!
Thanks for the hint.

jameslamb added bug r-package labels Apr 11, 2024

jameslamb assigned jmoralez Apr 16, 2024

jmoralez linked a pull request May 4, 2024 that will close this issue

[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] lgb.cv() fails with categorical features #6412

[R-package] lgb.cv() fails with categorical features #6412

MathiasAmbuehl commented Apr 11, 2024

jmoralez commented Apr 16, 2024

pimlabee commented Apr 22, 2024

MathiasAmbuehl commented Apr 22, 2024

[R-package] lgb.cv() fails with categorical features #6412

[R-package] lgb.cv() fails with categorical features #6412

Comments

MathiasAmbuehl commented Apr 11, 2024

Description

Reproducible example

Environment info

jmoralez commented Apr 16, 2024

pimlabee commented Apr 22, 2024

MathiasAmbuehl commented Apr 22, 2024