Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol #4135

Open
300LiterPropofol opened this issue Apr 24, 2024 · 3 comments
Assignees
Labels
bug: unconfirmed Something might not be working module: timeseries related to the timeseries module Needs Triage Issue requires Triage

Comments

@300LiterPropofol
Copy link

300LiterPropofol commented Apr 24, 2024

Describe the bug
I am using Google Colab with python 3 google compute backend, I install autogluon==1.1.0 freshly and just ran the sample training set and wanted to use a deep learning network.

df = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_hourly_subset/train.csv")

train_data = TimeSeriesDataFrame.from_data_frame(
    df,
    id_column="item_id",
    timestamp_column="timestamp"
)
predictor = TimeSeriesPredictor(
    prediction_length=48,
    path="autogluon-m4-hourly",
    target="target",
    eval_metric="MASE",
)

predictor.fit(
    train_data,
    hyperparameters= {
        "DeepAR": {}
    }
)

I can not execute it and got error:

Beginning AutoGluon training...
AutoGluon will save models to 'autogluon-m4-hourly'
=================== System Info ===================
AutoGluon Version:  1.1.0
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
CPU Count:          2
GPU Count:          0
Memory Avail:       10.27 GB / 12.67 GB (81.0%)
Disk Space Avail:   74.95 GB / 107.72 GB (69.6%)
===================================================

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MASE,
 'hyperparameters': {'DeepAR': {}},
 'known_covariates_names': [],
 'num_val_windows': 1,
 'prediction_length': 48,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'target',
 'verbosity': 2}

Inferred time series frequency: 'H'
Provided train_data has 148060 rows, 200 time series. Median time series length is 700 (min=700, max=960). 

Provided data contains following columns:
	target: 'target'

AutoGluon will gauge predictive performance using evaluation metric: 'MASE'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================

Starting training. Start time is 2024-04-24 14:58:20
Models that will be trained: ['DeepAR']
Training timeseries model DeepAR. 
	Warning: Exception caused DeepAR to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Not fitting ensemble as no models were successfully trained.
Training complete. Models trained: []
Total runtime: 0.07 s
Trainer has no fit models that can predict.
<autogluon.timeseries.predictor.TimeSeriesPredictor at 0x78931e9bb400>

This also happens to all deep learning models, seems they can not be trained:

Warning: Exception caused TemporalFusionTransformer to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Training timeseries model DeepAR. 
	Warning: Exception caused DeepAR to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Training timeseries model PatchTST. 
	Warning: Exception caused PatchTST to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Fitting simple weighted ensemble.
@300LiterPropofol 300LiterPropofol added bug: unconfirmed Something might not be working Needs Triage Issue requires Triage labels Apr 24, 2024
@300LiterPropofol 300LiterPropofol changed the title [BUG] [BUG] deep learning model unable to train on fresh environment, torchaudio/lib/libtorchaudio.so: undefined symbol Apr 24, 2024
@300LiterPropofol
Copy link
Author

Update:
If I install torchaudo separately

!pip install -U torch torchaudio --no-cache-dir
!pip install autogluon==1.1.0

I still get error but it is a different undefined symbol

Training timeseries model TemporalFusionTransformer. 
	Warning: Exception caused TemporalFusionTransformer to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv
Training timeseries model DeepAR. 
	Warning: Exception caused DeepAR to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv
Training timeseries model PatchTST. 
	Warning: Exception caused PatchTST to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv

@shchur
Copy link
Collaborator

shchur commented May 2, 2024

Hi @300LiterPropofol, thank you for reporting the issue! A quick and easy way to fix it is to uninstall torchaudio/torchvision/torchtext after installing AutoGluon with the following command:

!pip uninstall torchaudio torchvision torchtext

This command should already be there in the first cell when you click "Open in Colab" in the time series tutorials, but I will double check that it works as expected.

@shchur shchur added the module: timeseries related to the timeseries module label May 2, 2024
@shchur shchur changed the title [BUG] deep learning model unable to train on fresh environment, torchaudio/lib/libtorchaudio.so: undefined symbol [BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol May 2, 2024
@shchur shchur self-assigned this May 2, 2024
@300LiterPropofol
Copy link
Author

Yes! Thank you for saving me on this! Uninstall the packages solve the error indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: unconfirmed Something might not be working module: timeseries related to the timeseries module Needs Triage Issue requires Triage
Projects
None yet
Development

No branches or pull requests

2 participants