Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model training got stuck when running the official tutorial example #937

Open
hxuaj opened this issue Mar 21, 2024 · 4 comments
Open

model training got stuck when running the official tutorial example #937

hxuaj opened this issue Mar 21, 2024 · 4 comments
Labels

Comments

@hxuaj
Copy link

hxuaj commented Mar 21, 2024

What happened + What you expected to happen

Hi,
I'm new to nixtla. When I was trying to run the example code in official tutorial on my local machine(Linux, CentOS): https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html, I found it got stuck at nf.fit(df=Y_df) step:

2024-03-21 17:00:29,350	INFO worker.py:1724 -- Started a local Ray instance.
2024-03-21 17:00:29,926	INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2024-03-21 17:00:29,927	INFO tune.py:592 -- [output] This will use the new output engine with verbosity 0. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2024-03-21_17-00-27   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 5                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /root/ray_results/_train_tune_2024-03-21_17-00-27
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/_train_tune_2024-03-21_17-00-27`
(_train_tune pid=2885517) Seed set to 11
(_train_tune pid=2885517) [rank: 0] Seed set to 11
(_train_tune pid=2885517) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

The processe I did to set up:

  1. use conda to create a new environment with Python 3.9
  2. run pip install statsforecast s3fs datasetsforecast in the tutorial example.
  3. run pip install git+https://github.com/Nixtla/neuralforecast.git@main in the tutorial example.
  4. run pip install matplotlib in order to get the 3rd step of the tutorial work.
  5. I change the code in: nf = NeuralForecast( models=[ AutoNHITS(h=48, config=config_nhits, loss=MQLoss(), num_samples=5), AutoLSTM(h=48, config=config_lstm, loss=MQLoss(), num_samples=2), ], freq='H' ) with freq='H' to freq=1 since ValueError: Time column contains integers but the specified frequency is not an integer. Please provide a valid integer, e.g. 'freq=1'

I was wondering what could possibly go wrong in the upper steps and why it got stuck at the training process.

Then, I tried the tutorial notebook in Colab. The fit process can be done, though there is an error when evaluation evaluation_df = accuracy(cv_df, [mse, mae, rmse], agg_by=['unique_id']):

ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
    359             elif isinstance(obj, pd.DataFrame):
--> 360                 self._append_pa_schema(PD_UTILS.to_schema(obj))
    361             elif isinstance(obj, Tuple):  # type: ignore

11 frames
ValueError: pandas like datafame index can't have name

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
    370             raise
    371         except Exception as e:
--> 372             raise SchemaError(str(e))
    373 
    374     def remove(  # noqa: C901

SchemaError: pandas like datafame index can't have name

Looking forward to your reply.

Versions / Dependencies

OS: Linux CentOS
neuralforecast 1.6.4
python 3.9.18
ray 2.9.3
torch 2.2.1
transformers 4.39.0
pandas 2.2.1

Reproduction script

Official tutorial example: https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html

Issue Severity

High: It blocks me from completing my task.

@hxuaj hxuaj added the bug label Mar 21, 2024
@jmoralez
Copy link
Member

Hey @hxuaj, sorry for the troubles.

The first error should be fixed by setting the CUDA_VISIBLE_DEVICES env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ.

The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

evaluation_df = evaluate(cv_df, [mse, mae, rmse])

@hxuaj
Copy link
Author

hxuaj commented Mar 22, 2024

Hey @hxuaj, sorry for the troubles.

The first error should be fixed by setting the CUDA_VISIBLE_DEVICES env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ.

The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

evaluation_df = evaluate(cv_df, [mse, mae, rmse])

Hi @jmoralez ,
Thx for the quick reply.
For the first error, my local machine has 2 gpus, seems like a bug with Pytorch lightning: Lightning-AI/pytorch-lightning#4612. However I didn't find a proper solution to this. Just as you suggested, now I can run model fit with only one gpu visible as a workaround.
For the second error, I changed the code to:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

cv_df.reset_index(inplace=True)
evaluation_df = evaluate(cv_df, [mse, mae, rmse])

Just add index to df before evaluation. Now it works fine.

Could you update the relevant parts in this official tutorial, since it might be frustrated to encounter such error in the exampls. Thank you again.

@grant-d
Copy link

grant-d commented Apr 14, 2024

Just add index to df before evaluation. Now it works fine.

Just ran into the same issue, your workaround fixed it, thanks @hxuaj
@jmoralez , BTW, the error has a typo - datafame vs dataf_r_ame
(and may as well fix the grammer too: pandas-like dataframe index can't have name)

@jmoralez
Copy link
Member

jmoralez commented Apr 15, 2024

BTW, the error has a typo

That's not coming from our libs, feel free to open an issue in the corresponding lib.

@jmoralez jmoralez reopened this Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants