GPU Optimization with Num_Workers not working #2354

Laenita · 2024-04-26T12:37:31Z

I am not very experienced, but I loveee this package. However, my GPU acceleration seems to only utilize about 1% of my GPU. Increasing the batch size made my predictions far less accurate. And I read that increasing num_loader_workers will work, but I get an log stating that I should set persistent_workers =True in the val_dataloader package but I know Darts does not work this way. And then the model runs 5 times longer. Can you please assist? I just got a better GPU to optimize my training time but I can't get it to use more of the GPU? Here is my model for reference:

    NHiTS_Model = NHiTSModel(
    model_name="Nhits_run",
    input_chunk_length=input_length_chunk,
            output_chunk_length=forecasting_horizon,
            num_stacks=number_stacks,
            num_blocks=number_blocks,
            num_layers=number_layers,
            layer_widths=lay_widths,
            n_epochs=number_epochs,
            nr_epochs_val_period=number_epochs_val_period,
            batch_size=batch_size,
            dropout=dropout_rate,
            force_reset = True,
            save_checkpoints=True,
            optimizer_cls = torch.optim.AdamW,
            loss_fn = torch.nn.HuberLoss(), 
            random_state =rand_state,
            pl_trainer_kwargs={
                    "accelerator": "gpu", 
                    "devices": [0]}  
              )
    NHiTS_Model.fit(
        series=train,
        past_covariates=train_cov,
        verbose=True,
        val_series=val,
        val_past_covariates=val_cov,
        num_loader_workers=1
    )

Laenita · 2024-04-26T20:35:02Z

Oh and the newer gpu and the much weaker one trains the same length of time so somewhere is a bottle neck.

madtoinou · 2024-04-29T09:15:31Z

Hi @Laenita,

Would you mind sharing the value of the parameters? So that we can have an idea of the number of parameters/size of the model.

Is the GPU acceleration being used at 1% for both the old and the new devices?

The pl_trainer_kwargs argument looks good, this is what Pytorch-Lightning expects to enable this acceleration. I would recommend looking up their documentation at this this what Darts relies on for the deep learning models.

Laenita · 2024-05-01T21:23:10Z

Hi @madtoinou

Of course here are my parameters for my model I hope this helps:
input_length_chunk = 20
forecasting_horizon = 3
number_stacks = 4
number_blocks = 5
number_layers = 5
batch_size = 64
dropout_rate = 0.1
number_epochs = 180
number_epochs_val_period = 1

And yes, both the old and newer (and much faster) GPU's are both only showing 1% utilisation and also training the same time on the same model, indicating that something is wrong and heavy under-utilising.

But also the num_loader_workers=1 is not working at all for me, takes more than an hour with num_loader_workers >0.

Thanks for your assistance!

igorrivin · 2024-05-02T11:59:19Z

Yes, I have the same problem: I am told that num_loader_workers is not a legit parameter.

madtoinou · 2024-05-03T07:02:33Z

Hi @igorrivin & @Laenita,

As mentioned in another tread, the PR ##2295 is adding support for those arguments. Maybe try installing this branch/copy the changes and see if it solves the bottleneck?

Laenita · 2024-05-09T08:25:11Z

Hi @madtoinou

I have copied the changes from PR ##2295
But now whenever I add persistent_workers= True and num_loader_workers=16 (or even just 1) it gets stuck on Sanity_checking? Did I maybe miss anything? Thank you for your assistance!

madtoinou · 2024-05-10T06:50:46Z

Which sanity checking are you referring to?

Laenita · 2024-05-22T19:37:41Z

Hi @madtoinou, the best explanation I can show is this PNG where the model first goes into a Sanity Checking Phase before starting training:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Optimization with Num_Workers not working #2354

GPU Optimization with Num_Workers not working #2354

Laenita commented Apr 26, 2024 •

edited by madtoinou

Laenita commented Apr 26, 2024

madtoinou commented Apr 29, 2024

Laenita commented May 1, 2024 •

edited

igorrivin commented May 2, 2024

madtoinou commented May 3, 2024

Laenita commented May 9, 2024

madtoinou commented May 10, 2024

Laenita commented May 22, 2024

GPU Optimization with Num_Workers not working #2354

GPU Optimization with Num_Workers not working #2354

Comments

Laenita commented Apr 26, 2024 • edited by madtoinou

Laenita commented Apr 26, 2024

madtoinou commented Apr 29, 2024

Laenita commented May 1, 2024 • edited

igorrivin commented May 2, 2024

madtoinou commented May 3, 2024

Laenita commented May 9, 2024

madtoinou commented May 10, 2024

Laenita commented May 22, 2024

Laenita commented Apr 26, 2024 •

edited by madtoinou

Laenita commented May 1, 2024 •

edited