Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Optimization with Num_Workers not working #2354

Open
Laenita opened this issue Apr 26, 2024 · 8 comments
Open

GPU Optimization with Num_Workers not working #2354

Laenita opened this issue Apr 26, 2024 · 8 comments

Comments

@Laenita
Copy link

Laenita commented Apr 26, 2024

I am not very experienced, but I loveee this package. However, my GPU acceleration seems to only utilize about 1% of my GPU. Increasing the batch size made my predictions far less accurate. And I read that increasing num_loader_workers will work, but I get an log stating that I should set persistent_workers =True in the val_dataloader package but I know Darts does not work this way. And then the model runs 5 times longer. Can you please assist? I just got a better GPU to optimize my training time but I can't get it to use more of the GPU? Here is my model for reference:

    NHiTS_Model = NHiTSModel(
    model_name="Nhits_run",
    input_chunk_length=input_length_chunk,
            output_chunk_length=forecasting_horizon,
            num_stacks=number_stacks,
            num_blocks=number_blocks,
            num_layers=number_layers,
            layer_widths=lay_widths,
            n_epochs=number_epochs,
            nr_epochs_val_period=number_epochs_val_period,
            batch_size=batch_size,
            dropout=dropout_rate,
            force_reset = True,
            save_checkpoints=True,
            optimizer_cls = torch.optim.AdamW,
            loss_fn = torch.nn.HuberLoss(), 
            random_state =rand_state,
            pl_trainer_kwargs={
                    "accelerator": "gpu", 
                    "devices": [0]}  
              )
    NHiTS_Model.fit(
        series=train,
        past_covariates=train_cov,
        verbose=True,
        val_series=val,
        val_past_covariates=val_cov,
        num_loader_workers=1
    )
@Laenita
Copy link
Author

Laenita commented Apr 26, 2024

Oh and the newer gpu and the much weaker one trains the same length of time so somewhere is a bottle neck.

@madtoinou
Copy link
Collaborator

Hi @Laenita,

Would you mind sharing the value of the parameters? So that we can have an idea of the number of parameters/size of the model.

Is the GPU acceleration being used at 1% for both the old and the new devices?

The pl_trainer_kwargs argument looks good, this is what Pytorch-Lightning expects to enable this acceleration. I would recommend looking up their documentation at this this what Darts relies on for the deep learning models.

@Laenita
Copy link
Author

Laenita commented May 1, 2024

Hi @madtoinou

Of course here are my parameters for my model I hope this helps:
input_length_chunk = 20
forecasting_horizon = 3
number_stacks = 4
number_blocks = 5
number_layers = 5
batch_size = 64
dropout_rate = 0.1
number_epochs = 180
number_epochs_val_period = 1

And yes, both the old and newer (and much faster) GPU's are both only showing 1% utilisation and also training the same time on the same model, indicating that something is wrong and heavy under-utilising.

But also the num_loader_workers=1 is not working at all for me, takes more than an hour with num_loader_workers >0.

Thanks for your assistance!

@igorrivin
Copy link

Yes, I have the same problem: I am told that num_loader_workers is not a legit parameter.

@madtoinou
Copy link
Collaborator

Hi @igorrivin & @Laenita,

As mentioned in another tread, the PR ##2295 is adding support for those arguments. Maybe try installing this branch/copy the changes and see if it solves the bottleneck?

@Laenita
Copy link
Author

Laenita commented May 9, 2024

Hi @madtoinou

I have copied the changes from PR ##2295
But now whenever I add persistent_workers= True and num_loader_workers=16 (or even just 1) it gets stuck on Sanity_checking? Did I maybe miss anything? Thank you for your assistance!

@madtoinou
Copy link
Collaborator

Which sanity checking are you referring to?

@Laenita
Copy link
Author

Laenita commented May 22, 2024

Hi @madtoinou, the best explanation I can show is this PNG where the model first goes into a Sanity Checking Phase before starting training:
Sanity Checking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants