DistributedDataParallel and Early stopping does not work together #737

thomasjpfan · 2021-01-21T14:38:02Z

I had a recent conversation with a user that tried to use DistributedDataParallel with skorch's early stopping and this would cause the process to hang. My guess is that since ddp workers spawn their own jobs, skorch's early stopping mechanism would stop a worker, but the parent node would not get this information. This leaves the parent waiting for a child that has stopping running.

There may also be an issue with checking the validation loss with DistributedDataParallel, because each worker would have its own loss, and this would need to be gathered to actually compute the loss for a given epoch.

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2021-01-22T20:43:23Z

Thanks for reporting. Do you know how this is solved more generally (say, only using PyTorch without any frameworks)? I could imagine that similar errors can occur easily, given how tricky multi-threading is in general. Unfortunately, I don't have access to a setup to experiment with this.

thomasjpfan · 2021-01-29T18:55:50Z

I do not have a setup to experiment with this either. I've seen two solutions.

During validation, move everything to one gpu and compute the loss/metrics there: https://github.com/Lance0218/Pytorch-DistributedDataParallel-Training-Tricks/blob/fa709835c7bf5e62f48c72b90eb12f3b795ef07d/DDP_warmup.py#L140-L151
During validation, distribute the data to all gpus, use a barrier to wait for validation is complete:

https://github.com/allenai/allennlp/blob/39c40fe38cd2fd36b3465b0b3c031f54ec824160/allennlp/training/trainer.py#L1022-L1025

BTW there are a bunch of barrier calls in this file to handle the distributed case.

Since we do not have the resources to test DDP, I think it would be hard to officially support it.

BenjaminBossan · 2021-01-30T17:30:24Z

I know too little to really comment on that. Ideally, I would wish for skorch to get out of the way enough that users can use DistributedDataParallel if they wish so. Regarding barriers, is that something that could be achieved through callbacks?

BenjaminBossan added the bug label Jan 22, 2021

francescamanni1989 mentioned this issue Jul 5, 2021

Using Skorch NeuralNetClassifier as booster for XGBoost #789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedDataParallel and Early stopping does not work together #737

DistributedDataParallel and Early stopping does not work together #737

thomasjpfan commented Jan 21, 2021

BenjaminBossan commented Jan 22, 2021

thomasjpfan commented Jan 29, 2021

BenjaminBossan commented Jan 30, 2021

DistributedDataParallel and Early stopping does not work together #737

DistributedDataParallel and Early stopping does not work together #737

Comments

thomasjpfan commented Jan 21, 2021

BenjaminBossan commented Jan 22, 2021

thomasjpfan commented Jan 29, 2021

BenjaminBossan commented Jan 30, 2021