You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When there is a mismatch between the dtype settings of the model and ds_config, training starts without any specific error and the loss turns NaN (this issue occurs mainly in Zero stage0).
I suggest adding a dtype check between the model and config during the execution of deepspeed.initialize and throwing an assert if they do not match. What do you think?
+ net = net.half()
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
args=args,
model=net,
model_parameters=parameters,
training_data=trainset,
config=ds_config,
)
# Get the local device name (str) and local rank (int).
local_device = get_accelerator().device_name(model_engine.local_rank)
local_rank = model_engine.local_rank
# For float32, target_dtype will be None so no datatype conversion needed.
target_dtype = None
if model_engine.bfloat16_enabled():
target_dtype = torch.bfloat16
elif model_engine.fp16_enabled():
target_dtype = torch.half
+ target_dtype = torch.half
Describe the bug
When there is a mismatch between the dtype settings of the model and ds_config, training starts without any specific error and the loss turns NaN (this issue occurs mainly in Zero stage0).
I suggest adding a dtype check between the model and config during the execution of deepspeed.initialize and throwing an assert if they do not match. What do you think?
To Reproduce
The text was updated successfully, but these errors were encountered: