Replies: 1 comment 1 reply
-
When I tried distributed TensorFlow via grpc a couple of years ago, it was very slow compared to Horovod, but this might have improved of course. How fast is your network link? How much slower is your 2x2 distributed training per step compared to a local training with 1x2 GPUs? Is the overhead larger than the expected transfer times via your network link? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
sorry, have some question to ask...
I have tried tf.distribute.experimental.MultiWorkerMirroredStrategy on one subnetwork with 2 machines with 2 GPU card of each to deploy tf.distribute.experimental.MultiWorkerMirroredStrategy to train one albert text classification model training, but found ttrainning is slow. the training communcation is by the "grpc".
If I use horovod to do distribute training, will be faster than tf.distribute.experimental.MultiWorkerMirroredStrategy in raw tensorflow ?
horovod do more optimization than native tensorflow ? thanks.
Beta Was this translation helpful? Give feedback.
All reactions