[BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol #4135

300LiterPropofol · 2024-04-24T15:01:26Z

Describe the bug
I am using Google Colab with python 3 google compute backend, I install autogluon==1.1.0 freshly and just ran the sample training set and wanted to use a deep learning network.

df = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_hourly_subset/train.csv")

train_data = TimeSeriesDataFrame.from_data_frame(
    df,
    id_column="item_id",
    timestamp_column="timestamp"
)
predictor = TimeSeriesPredictor(
    prediction_length=48,
    path="autogluon-m4-hourly",
    target="target",
    eval_metric="MASE",
)

predictor.fit(
    train_data,
    hyperparameters= {
        "DeepAR": {}
    }
)

I can not execute it and got error:

Beginning AutoGluon training...
AutoGluon will save models to 'autogluon-m4-hourly'
=================== System Info ===================
AutoGluon Version:  1.1.0
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
CPU Count:          2
GPU Count:          0
Memory Avail:       10.27 GB / 12.67 GB (81.0%)
Disk Space Avail:   74.95 GB / 107.72 GB (69.6%)
===================================================

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MASE,
 'hyperparameters': {'DeepAR': {}},
 'known_covariates_names': [],
 'num_val_windows': 1,
 'prediction_length': 48,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'target',
 'verbosity': 2}

Inferred time series frequency: 'H'
Provided train_data has 148060 rows, 200 time series. Median time series length is 700 (min=700, max=960). 

Provided data contains following columns:
	target: 'target'

AutoGluon will gauge predictive performance using evaluation metric: 'MASE'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================

Starting training. Start time is 2024-04-24 14:58:20
Models that will be trained: ['DeepAR']
Training timeseries model DeepAR. 
	Warning: Exception caused DeepAR to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Not fitting ensemble as no models were successfully trained.
Training complete. Models trained: []
Total runtime: 0.07 s
Trainer has no fit models that can predict.
<autogluon.timeseries.predictor.TimeSeriesPredictor at 0x78931e9bb400>

This also happens to all deep learning models, seems they can not be trained:

Warning: Exception caused TemporalFusionTransformer to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Training timeseries model DeepAR. 
	Warning: Exception caused DeepAR to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Training timeseries model PatchTST. 
	Warning: Exception caused PatchTST to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN2at4_ops13scalar_tensor4callERKN3c106ScalarESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbE
Fitting simple weighted ensemble.

The text was updated successfully, but these errors were encountered:

300LiterPropofol · 2024-04-26T14:22:54Z

Update:
If I install torchaudo separately

!pip install -U torch torchaudio --no-cache-dir
!pip install autogluon==1.1.0

I still get error but it is a different undefined symbol

Training timeseries model TemporalFusionTransformer. 
	Warning: Exception caused TemporalFusionTransformer to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv
Training timeseries model DeepAR. 
	Warning: Exception caused DeepAR to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv
Training timeseries model PatchTST. 
	Warning: Exception caused PatchTST to fail during training... Skipping this model.
	/usr/local/lib/python3.10/dist-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv

shchur · 2024-05-02T08:25:24Z

Hi @300LiterPropofol, thank you for reporting the issue! A quick and easy way to fix it is to uninstall torchaudio/torchvision/torchtext after installing AutoGluon with the following command:

!pip uninstall torchaudio torchvision torchtext

This command should already be there in the first cell when you click "Open in Colab" in the time series tutorials, but I will double check that it works as expected.

300LiterPropofol · 2024-05-02T11:36:16Z

Yes! Thank you for saving me on this! Uninstall the packages solve the error indeed!

300LiterPropofol added bug: unconfirmed Something might not be working Needs Triage Issue requires Triage labels Apr 24, 2024

300LiterPropofol changed the title ~~[BUG]~~ [BUG] deep learning model unable to train on fresh environment, torchaudio/lib/libtorchaudio.so: undefined symbol Apr 24, 2024

shchur added the module: timeseries related to the timeseries module label May 2, 2024

shchur changed the title ~~[BUG] deep learning model unable to train on fresh environment, torchaudio/lib/libtorchaudio.so: undefined symbol~~ [BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol May 2, 2024

shchur self-assigned this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol #4135

[BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol #4135

300LiterPropofol commented Apr 24, 2024 •

edited

300LiterPropofol commented Apr 26, 2024

shchur commented May 2, 2024

300LiterPropofol commented May 2, 2024

[BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol #4135

[BUG] deep learning model unable to train in Colab, torchaudio/lib/libtorchaudio.so: undefined symbol #4135

Comments

300LiterPropofol commented Apr 24, 2024 • edited

300LiterPropofol commented Apr 26, 2024

shchur commented May 2, 2024

300LiterPropofol commented May 2, 2024

300LiterPropofol commented Apr 24, 2024 •

edited