Odd values for VAL_MAE & VAL_RMSE for TFT #1507

KunzstBGR · 2024-02-06T18:55:13Z

PyTorch-Forecasting version: 1.0.0
PyTorch version: 2.0.1
Pytorch-lightning version: 2.1.2
Python version: 3.9.13
Operating System: Windows10

We use the TFT to predict groundwater levels for > 5000 wells with an horizon of 12 weeks.

Expected behavior

When train and validation loss decrease during training I would expect that the validation metrics, specifically MAE (VAL_MAE) and RMSE (VAL_RMSE), would also decrease.

Actual behavior

During training with the TFT, I noticed that VAL_MAE and VAL_RMSE remained almost constant and that values were extremely high for the problem we were studying. The loss curves for training and validation looked normal. Val_check_interval was set to 0.1. See the graphs below.

What I then did was to take model checkpoints at every 10% of the training progress. Then I made predictions with each model checkpoint on the validationset and compared the MAE and RMSE values with the VAL_MAE and VAL_RMSE values I observed during training. The predicted MAE and RMSE values are on average (i.e. average over the 5000 wells) ~ 0.5 or 0.4 for each horizon, which makes more sense.
Did anyone else observed this behaviour when woking with the TFT? Is the comparison I made reasonable and does it mean that there's a potential problem with the implementation of the validationmetrics of the TFT? (btw. the fully trained TFT model doesn't perform bad at all)

Unfortunately, I cannot share my code atm, as this is part of a research project.

VAL_MAE: