Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedDataParallel and Early stopping does not work together #737

Open
thomasjpfan opened this issue Jan 21, 2021 · 3 comments
Open
Labels

Comments

@thomasjpfan
Copy link
Member

I had a recent conversation with a user that tried to use DistributedDataParallel with skorch's early stopping and this would cause the process to hang. My guess is that since ddp workers spawn their own jobs, skorch's early stopping mechanism would stop a worker, but the parent node would not get this information. This leaves the parent waiting for a child that has stopping running.

There may also be an issue with checking the validation loss with DistributedDataParallel, because each worker would have its own loss, and this would need to be gathered to actually compute the loss for a given epoch.

@BenjaminBossan
Copy link
Collaborator

Thanks for reporting. Do you know how this is solved more generally (say, only using PyTorch without any frameworks)? I could imagine that similar errors can occur easily, given how tricky multi-threading is in general. Unfortunately, I don't have access to a setup to experiment with this.

@thomasjpfan
Copy link
Member Author

I do not have a setup to experiment with this either. I've seen two solutions.

  1. During validation, move everything to one gpu and compute the loss/metrics there: https://github.com/Lance0218/Pytorch-DistributedDataParallel-Training-Tricks/blob/fa709835c7bf5e62f48c72b90eb12f3b795ef07d/DDP_warmup.py#L140-L151
  2. During validation, distribute the data to all gpus, use a barrier to wait for validation is complete:

https://github.com/allenai/allennlp/blob/39c40fe38cd2fd36b3465b0b3c031f54ec824160/allennlp/training/trainer.py#L1022-L1025

BTW there are a bunch of barrier calls in this file to handle the distributed case.

Since we do not have the resources to test DDP, I think it would be hard to officially support it.

@BenjaminBossan
Copy link
Collaborator

I know too little to really comment on that. Ideally, I would wish for skorch to get out of the way enough that users can use DistributedDataParallel if they wish so. Regarding barriers, is that something that could be achieved through callbacks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants