model training got stuck when running the official tutorial example #937

hxuaj · 2024-03-21T10:00:19Z

What happened + What you expected to happen

Hi,
I'm new to nixtla. When I was trying to run the example code in official tutorial on my local machine(Linux, CentOS): https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html, I found it got stuck at nf.fit(df=Y_df) step:

2024-03-21 17:00:29,350	INFO worker.py:1724 -- Started a local Ray instance.
2024-03-21 17:00:29,926	INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2024-03-21 17:00:29,927	INFO tune.py:592 -- [output] This will use the new output engine with verbosity 0. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2024-03-21_17-00-27   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 5                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /root/ray_results/_train_tune_2024-03-21_17-00-27
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/_train_tune_2024-03-21_17-00-27`
(_train_tune pid=2885517) Seed set to 11
(_train_tune pid=2885517) [rank: 0] Seed set to 11
(_train_tune pid=2885517) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

The processe I did to set up:

use conda to create a new environment with Python 3.9
run pip install statsforecast s3fs datasetsforecast in the tutorial example.
run pip install git+https://github.com/Nixtla/neuralforecast.git@main in the tutorial example.
run pip install matplotlib in order to get the 3rd step of the tutorial work.
I change the code in: nf = NeuralForecast( models=[ AutoNHITS(h=48, config=config_nhits, loss=MQLoss(), num_samples=5), AutoLSTM(h=48, config=config_lstm, loss=MQLoss(), num_samples=2), ], freq='H' ) with freq='H' to freq=1 since ValueError: Time column contains integers but the specified frequency is not an integer. Please provide a valid integer, e.g. 'freq=1'

I was wondering what could possibly go wrong in the upper steps and why it got stuck at the training process.

Then, I tried the tutorial notebook in Colab. The fit process can be done, though there is an error when evaluation evaluation_df = accuracy(cv_df, [mse, mae, rmse], agg_by=['unique_id']):

ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
    359             elif isinstance(obj, pd.DataFrame):
--> 360                 self._append_pa_schema(PD_UTILS.to_schema(obj))
    361             elif isinstance(obj, Tuple):  # type: ignore

11 frames
ValueError: pandas like datafame index can't have name

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
    370             raise
    371         except Exception as e:
--> 372             raise SchemaError(str(e))
    373 
    374     def remove(  # noqa: C901

SchemaError: pandas like datafame index can't have name

Looking forward to your reply.

Versions / Dependencies

OS: Linux CentOS
neuralforecast 1.6.4
python 3.9.18
ray 2.9.3
torch 2.2.1
transformers 4.39.0
pandas 2.2.1

Reproduction script

Official tutorial example: https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

jmoralez · 2024-03-22T01:04:17Z

Hey @hxuaj, sorry for the troubles.

The first error should be fixed by setting the CUDA_VISIBLE_DEVICES env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ.

The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

evaluation_df = evaluate(cv_df, [mse, mae, rmse])

hxuaj · 2024-03-22T07:47:52Z

Hey @hxuaj, sorry for the troubles.

The first error should be fixed by setting the CUDA_VISIBLE_DEVICES env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ.

The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

evaluation_df = evaluate(cv_df, [mse, mae, rmse])

Hi @jmoralez ,
Thx for the quick reply.
For the first error, my local machine has 2 gpus, seems like a bug with Pytorch lightning: Lightning-AI/pytorch-lightning#4612. However I didn't find a proper solution to this. Just as you suggested, now I can run model fit with only one gpu visible as a workaround.
For the second error, I changed the code to:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

cv_df.reset_index(inplace=True)
evaluation_df = evaluate(cv_df, [mse, mae, rmse])

Just add index to df before evaluation. Now it works fine.

Could you update the relevant parts in this official tutorial, since it might be frustrated to encounter such error in the exampls. Thank you again.

grant-d · 2024-04-14T21:56:43Z

Just add index to df before evaluation. Now it works fine.

Just ran into the same issue, your workaround fixed it, thanks @hxuaj
@jmoralez , BTW, the error has a typo - datafame vs dataf_r_ame
(and may as well fix the grammer too: pandas-like dataframe index can't have name)

jmoralez · 2024-04-15T17:06:05Z

BTW, the error has a typo

That's not coming from our libs, feel free to open an issue in the corresponding lib.

hxuaj added the bug label Mar 21, 2024

jmoralez added the awaiting response label Mar 22, 2024

github-actions bot removed the awaiting response label Mar 22, 2024

hxuaj mentioned this issue Apr 1, 2024

TimeLLM takes a long time to setup training. #950

Closed

elephaint closed this as completed Apr 29, 2024

jmoralez reopened this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model training got stuck when running the official tutorial example #937

model training got stuck when running the official tutorial example #937

hxuaj commented Mar 21, 2024

jmoralez commented Mar 22, 2024

hxuaj commented Mar 22, 2024 •

edited

grant-d commented Apr 14, 2024

jmoralez commented Apr 15, 2024 •

edited

model training got stuck when running the official tutorial example #937

model training got stuck when running the official tutorial example #937

Comments

hxuaj commented Mar 21, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jmoralez commented Mar 22, 2024

hxuaj commented Mar 22, 2024 • edited

grant-d commented Apr 14, 2024

jmoralez commented Apr 15, 2024 • edited

hxuaj commented Mar 22, 2024 •

edited

jmoralez commented Apr 15, 2024 •

edited