Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError on variable Validation Batch Sizes in TemporalFusionTransformer Tutorial #1509

Open
Tracked by #1511
nejox opened this issue Feb 7, 2024 · 6 comments
Open
Tracked by #1511

Comments

@nejox
Copy link

nejox commented Feb 7, 2024

  • PyTorch-Forecasting version: 1.0.0
  • PyTorch version: 2.2.0 (Colab 2.1.0)
  • PyTorch Lightning: 2.1.4
  • Python version: 3.9 (Colab: 3.10)
  • Operating System: macOS 14.2.1 (23C71)

Expected Behavior

I executed the TemporalFusionTransformer tutorial code to forecast demand on the Tutorial Dataset. I expected the model to train without issues and validate across multiple batches.

Actual Behavior

The tutorial's batch size configuration results in only one validation batch, thereby initially masking the error. When the validation DataLoader splits the dataset into multiple batches, with the last batch containing fewer samples than the specified batch size, I encountered a RuntimeError related to tensor size mismatch. Attempting to set drop_last=True did not resolve the issue because this setting is overridden when the mode is set to "PREDICTING" as seen here in the PyTorch Lightning codebase.

It appears to me that in this case, the concatenation dimension may be incorrectly specified here in the PyTorch Forecasting codebase.

Manually forcing drop_last=True to stay (or all batches having the same size) led to a mismatch in the dimensions of predict()'s output and y attributes, further indicating the issue likely resides in the specified dimension for concatenation.

Code to reproduce the problem

The issue is reproduced in this Colab notebook.

Snippet of it setting the batch size:

# create dataloaders for model
batch_size = 128  # set this between 32 to 128
#val_batch_size = batch_size * 10 -> this led to no error as we then only have 1 validation batch and no concatenation during predict() is needed
val_batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=val_batch_size, num_workers=0)

leads to

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-25-679346faf40d>](https://localhost:8080/#) in <cell line: 1>()
----> 1 predictions = tft.predict(val_dataloader, return_y=True, trainer_kwargs=dict(accelerator="cpu"))
      2 print("Output shape:", predictions.output.shape)
      3 print("Y shape:", predictions.y[0].shape)
      4 MAE()(predictions.output, predictions.y)

12 frames
[/usr/local/lib/python3.10/dist-packages/pytorch_forecasting/utils.py](https://localhost:8080/#) in concat_sequences(sequences)
    247         return rnn.pack_sequence(sequences, enforce_sorted=False)
    248     elif isinstance(sequences[0], torch.Tensor):
--> 249         return torch.cat(sequences, dim=1)
    250     elif isinstance(sequences[0], (tuple, list)):
    251         return tuple(

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 128 but got size 94 for tensor number 2 in the list.
@Luke-Chesley
Copy link

I was able to recreate your problem.

I changed return torch.cat(sequences, dim=1) to return torch.cat(sequences, dim=0) in pytorch_forecasting/utils.py line 249 and it does not raise the error when val_batch_size=128 in this example. After cat along dim=0 the resulting tensor would have shape (350,6), same as when val_batch_size = 1280. It seems like this is what rnn.pack_sequences() is doing for rnn.packedsequences in line 247 too.

Let me know if this works for you.

@nejox
Copy link
Author

nejox commented Feb 8, 2024

Thanks! This solves the error. This seems to be a major bug as it should appear at almost every scenario where you use multiple validation batches, right?
I'm also wondering if overwriting the drop_last Parameter in the Lightning Module makes sense, but that's something else...

@Luke-Chesley
Copy link

I have not spent a lot of time making predictions using the val_dataloader and I just kept the defaults from the tutorial, maybe this is why it has not been encountered before. I haven't had this issue when using tft.predict on new/future prediction data (using the format for prediction data from the tutorial), but I have only done that one batch at a time. I will have to look into it more.

@fazaki
Copy link

fazaki commented Apr 3, 2024

Hi @nejox
I ran into the same exact error, thanks very much for sharing.
I wonder how did you manage to fix it, as I don't see the fix merged to the master branch, and no newer versions has been released

@nejox
Copy link
Author

nejox commented Apr 3, 2024

hi @fazaki, I didn't really fix that error in my case. For some tests I applied the patch from pull #1511 manually, but in the end I switched to Darts.

@fazaki
Copy link

fazaki commented Apr 3, 2024

Oh, I see, darts was my backup plan indeed.
I tried to install the forked repo by Luke and it worked

pip install git+https://github.com/Luke-Chesley/pytorch-forecasting.git@master
Thanks @nejox

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants