Resolve TF saved model not portable issue with tf.keras.optimizers #4031
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist before submitting
Description
Fixes #4028
Root Cause: the
self._allreduce_grad
will be treated as a tf.function when we initialize the optimizer outside tf.function(code). Since all the concrete function will be saved into saved_model.pb (see code here), it will cause the HorovodAllRuduce Op to be saved into the graph, which might not be registered in other environment.When using the tf.keras.optimzer.legacy, It will not reach this path because when we call model.fit, the
optimizer.minimize
will call _compute_gradients which is not overriden by compute_gradient function in DistributedOptimizer, so the distributed optimizer is not taking effect at all!Resolution
We don't need to explicitly register the
allreduce_grad
as a tf function, as in graph mode, it will be trace in the outer tf.function. In this case, the horovod ops will not be saved explicitlyReview process to land