[BUG] ConstantGradScaler and loss-scale argument not match #776

BeingGod · 2024-04-12T04:47:53Z

Describe the bug
The usage and description of loss-scale is inconsistent. The argument of loss-scale expect to get a number of positive power of 2 but ConstantGradScaler set loss-scale to real scale directly rather than 2**loss-scale.

Argument Description:

Argument Usage:

To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior
A clear and concise description of what you expected to happen.

Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch version
CUDA version
NCCL version

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ConstantGradScaler and loss-scale argument not match #776

[BUG] ConstantGradScaler and loss-scale argument not match #776

BeingGod commented Apr 12, 2024 •

edited

[BUG] ConstantGradScaler and loss-scale argument not match #776

[BUG] ConstantGradScaler and loss-scale argument not match #776

Comments

BeingGod commented Apr 12, 2024 • edited

BeingGod commented Apr 12, 2024 •

edited