You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call optimizer.step() the two optimizers perform grad-norm for their own parameters.
But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.
So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?
The text was updated successfully, but these errors were encountered:
If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call
optimizer.step()
the two optimizers perform grad-norm for their own parameters.But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.
So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?
The text was updated successfully, but these errors were encountered: