[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #785

ezioliao · 2024-04-19T08:51:56Z

If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call optimizer.step() the two optimizers perform grad-norm for their own parameters.

But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.

So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #785

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #785

ezioliao commented Apr 19, 2024

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #785

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #785

Comments

ezioliao commented Apr 19, 2024