What's the difference between tf.distribute.experimental.MultiWorkerMirroredStrategy and horovod ? #3285

Yazooliu · 2021-11-22T11:39:28Z

Yazooliu
Nov 22, 2021

sorry, have some question to ask...
I have tried tf.distribute.experimental.MultiWorkerMirroredStrategy on one subnetwork with 2 machines with 2 GPU card of each to deploy tf.distribute.experimental.MultiWorkerMirroredStrategy to train one albert text classification model training, but found ttrainning is slow. the training communcation is by the "grpc".

If I use horovod to do distribute training, will be faster than tf.distribute.experimental.MultiWorkerMirroredStrategy in raw tensorflow ?
horovod do more optimization than native tensorflow ? thanks.

maxhgerlach · 2021-11-22T12:53:50Z

maxhgerlach
Nov 22, 2021
Collaborator

When I tried distributed TensorFlow via grpc a couple of years ago, it was very slow compared to Horovod, but this might have improved of course.

How fast is your network link? How much slower is your 2x2 distributed training per step compared to a local training with 1x2 GPUs? Is the overhead larger than the expected transfer times via your network link?

1 reply

Yazooliu Dec 27, 2021
Author

Thanks for your replay.
Recently i also try the new network with 10Gbps/s compare with previous 1Gbps/s to do the training and get the statics data like this:
Task1: albet-tiny4-layers text classifiications, batchsize=128
2X2 distributed training(2 machines, 4 gpus in total ) , use MultiWorkerMirroredStrategy, complete training times: 287s. GPU-Util: 41.76% in average, the network transfer bandwidth used is 260MB/s (max value ), 218MB/s(mean value); traing speed: global_step/sec
=7.846;
1X2 distributed training(1 machines, 2 gpus in total ) , use MirroredStrategy, complete training times: 263s. GPU-Util: 87.48% in average, no network needed, training speed: global_step/sec
=16.971

Task1: albet-base12-layers text classifiications, batchsize=64
2X2 distributed training(2 machines, 4 gpus in total ) , use MultiWorkerMirroredStrategy, complete training times: 1013s. GPU-Util: 71.89% in average, the network transfer bandwidth used is 182MB/s (max value ), 153MB/s(mean value); training speed: global_step/sec
=4.367;
1X2 distributed training(1 machines, 2 gpus in total ) , use MirroredStrategy, complete training times: 1522.12s. GPU-Util: 87.48% in average, no network needed, training speed：global_step/sec=5.639
I also try Data parallel method in training code and also make training data to TFrecord/ or set the os.environment["NCCL_SOCKER_NTHEREAD"] = 8 to accelerate the network. Any new suggestion to acclerate the training speed or improve the GPU-UTil usage？

I also found that, some training work, 2X2 distributed trianing GPU memory usage (eg, 82.9%) will be lower than 1X2 distributed GPU usage (eg, 96%), why?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the difference between tf.distribute.experimental.MultiWorkerMirroredStrategy and horovod ? #3285

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What's the difference between tf.distribute.experimental.MultiWorkerMirroredStrategy and horovod ? #3285

Yazooliu Nov 22, 2021

Replies: 1 comment · 1 reply

maxhgerlach Nov 22, 2021 Collaborator

Yazooliu Dec 27, 2021 Author

Yazooliu
Nov 22, 2021

Replies: 1 comment 1 reply

maxhgerlach
Nov 22, 2021
Collaborator

Yazooliu Dec 27, 2021
Author