Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-GPU training #676

Open
srogatch opened this issue May 25, 2023 · 4 comments
Open

Support multi-GPU training #676

srogatch opened this issue May 25, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@srogatch
Copy link

I couldn't so far find a way to train on multiple GPUs within the same computer. If it exists, please, describe the way to do it.

@isaacmg
Copy link
Collaborator

isaacmg commented Jun 8, 2023

Hello sorry for the delay. We do currently have Docker containers which you can use with Wandb to perform a distributed hyper-parameter sweep. IMO multi-GPU for a single model isn't much benefit it is very hard to saturate even a single GPU unless you have huge batch sizes. Bottleneck generally comes from things.

@srogatch
Copy link
Author

srogatch commented Jun 8, 2023

I have batch size 64, history length 1440, lookahead 480, and 2 million points in the time series, each consisting of 4 values. A single GPU is saturated 97-100% currently, and judging from power consumption, it's indeed fully saturated and I can benefit from multiple GPUs.

@isaacmg
Copy link
Collaborator

isaacmg commented Jun 8, 2023

Interesting, I've never really run into that problem before. Let me look into it. FF is built on top of PyTorch of course so it is hopefully it is something I could reasonably add quickly. Out of the box as of now though we don't support it as we mainly use model.to()

@isaacmg isaacmg added the enhancement New feature or request label Jun 8, 2023
@srogatch
Copy link
Author

srogatch commented Jun 9, 2023

Yes, we need to add DistributedDataParallel object, multi-processing launch, get the local rank of each process, and use it as the device parameter in model.to(). I planned to add this myself, but unfortunately, afterwards I had to postpone this project because I got some higher priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants