TimeSeriesSVR input causes nan errors #440

scott-c-jensen · 2023-01-24T23:07:21Z

Describe the bug
Some inputs fail when fitting and give the nan error when no nans exist in the data. I provide a minimal example here which gives the same error as my larger dataset but may/may not be the same root cause. Note that my larger dataset is trimmed to avoid the 405 datapoint limit reported elsewhere.

To Reproduce
The following minimal example gives the error (see below)

from tslearn.svm import TimeSeriesSVR
from tslearn.utils import to_time_series_dataset
import numpy as np
X = to_time_series_dataset([ np.ones(3), np.ones(3)*2, np.ones(3)])
y = [0, 3, 0.1]
clf = TimeSeriesSVR(C=1., kernel="gak")
clf.fit(X, y)

Expected behavior
Not to fail but to actually fit, or give a better error message.

Environment (please complete the following information):

Windows 11
tslearn version [0.5.3.2]
numpy [1.23.1]
python [3.10.5]

Additional context
Here is the error code:

5 y = [0, 3, 0.1]
6 clf = TimeSeriesSVR(C=1., kernel="gak")
----> 7 clf.fit(X, y)
File ~\PyVenvs\test2\lib\site-packages\tslearn\svm\svm.py:552, in TimeSeriesSVR.fit(self, X, y, sample_weight)
544 sklearn_X, y = self.preprocess_sklearn(X, y, fit_time=True)
546 self.svm_estimator = SVR(
547 C=self.C, kernel=self.estimator_kernel_, degree=self.degree,
548 gamma=self.gamma_, coef0=self.coef0, shrinking=self.shrinking,
549 tol=self.tol, cache_size=self.cache_size,
550 verbose=self.verbose, max_iter=self.max_iter
551 )
--> 552 self.svm_estimator_.fit(sklearn_X, y, sample_weight=sample_weight)
553 return self

File ~\PyVenvs\test2\lib\site-packages\sklearn\svm_base.py:192, in BaseLibSVM.fit(self, X, y, sample_weight)
190 check_consistent_length(X, y)
191 else:
--> 192 X, y = self._validate_data(
193 X,
194 y,
195 dtype=np.float64,
196 order="C",
197 accept_sparse="csr",
198 accept_large_sparse=False,
199 )
201 y = self._validate_targets(y)
203 sample_weight = np.asarray(
204 [] if sample_weight is None else sample_weight, dtype=np.float64
205 )

File ~\PyVenvs\test2\lib\site-packages\sklearn\base.py:554, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
552 y = check_array(y, input_name="y", **check_y_params)
553 else:
--> 554 X, y = check_X_y(X, y, **check_params)
555 out = X, y
557 if not no_val_X and check_params.get("ensure_2d", True):

File ~\PyVenvs\test2\lib\site-packages\sklearn\utils\validation.py:1104, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1099 estimator_name = _check_estimator_name(estimator)
1100 raise ValueError(
1101 f"{estimator_name} requires y to be passed, but the target y is None"
1102 )
-> 1104 X = check_array(
1105 X,
1106 accept_sparse=accept_sparse,
1107 accept_large_sparse=accept_large_sparse,
1108 dtype=dtype,
1109 order=order,
1110 copy=copy,
1111 force_all_finite=force_all_finite,
1112 ensure_2d=ensure_2d,
1113 allow_nd=allow_nd,
1114 ensure_min_samples=ensure_min_samples,
1115 ensure_min_features=ensure_min_features,
1116 estimator=estimator,
1117 input_name="X",
1118 )
1120 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1122 check_consistent_length(X, y)

File ~\PyVenvs\test2\lib\site-packages\sklearn\utils\validation.py:919, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
913 raise ValueError(
914 "Found array with dim %d. %s expected <= 2."
915 % (array.ndim, estimator_name)
916 )
918 if force_all_finite:
--> 919 _assert_all_finite(
920 array,
921 input_name=input_name,
922 estimator_name=estimator_name,
923 allow_nan=force_all_finite == "allow-nan",
924 )
926 if ensure_min_samples > 0:
927 n_samples = _num_samples(array)

File ~\PyVenvs\test2\lib\site-packages\sklearn\utils\validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
144 if estimator_name and input_name == "X" and has_nan_error:
145 # Improve the error message on how to handle missing values in
146 # scikit-learn.
147 msg_err += (
148 f"\n{estimator_name} does not accept missing values"
149 " encoded as NaN natively. For supervised learning, you might want"
(...)
159 "#estimators-that-handle-nan-values"
160 )
--> 161 raise ValueError(msg_err)

ValueError: Input X contains NaN.
SVR does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

lomnes-atlast-food · 2023-03-24T12:57:31Z

I ran into this issue as well. Thanks for reporting this bug. I found that the issue arises from nan values being generated in the kernel 'gak' but the error messages said that unsupported values were in 'Input X', which is confusing to users. Fixes needed for 'gak' itself and for the confusing error message.

My example used:

from tslearn.svm import TimeSeriesSVC
from tslearn.utils import to_time_series_dataset
import numpy as np
X = to_time_series_dataset([ np.ones(3), np.ones(3)*2, np.ones(3)])
y = [0, 1, 0]
clf = TimeSeriesSVC(kernel="gak")
clf.fit(X, y)

Gives first error in trace:

venv/lib/python3.9/site-packages/tslearn/metrics/softdtw_variants.py:44: RuntimeWarning: invalid value encountered in divide
  gram = -cdist(s1, s2, "sqeuclidean") / (2 * sigma**2)

Last error in trace:

ValueError: Input X contains NaN.
SVC does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Changing the kernel to "rbf" results in no error.

from tslearn.svm import TimeSeriesSVC
from tslearn.utils import to_time_series_dataset
import numpy as np
X = to_time_series_dataset([ np.ones(3), np.ones(3)*2, np.ones(3)])
y = [0, 1, 0]
clf = TimeSeriesSVC(kernel="rbf")
clf.fit(X, y)

scott-c-jensen added the bug label Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeSeriesSVR input causes nan errors #440

TimeSeriesSVR input causes nan errors #440

scott-c-jensen commented Jan 24, 2023

lomnes-atlast-food commented Mar 24, 2023

TimeSeriesSVR input causes nan errors #440

TimeSeriesSVR input causes nan errors #440

Comments

scott-c-jensen commented Jan 24, 2023

lomnes-atlast-food commented Mar 24, 2023