[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss #5509

Taiki-azrs · 2024-05-08T18:32:44Z

Describe the bug
When there is a mismatch between the dtype settings of the model and ds_config, training starts without any specific error and the loss turns NaN (this issue occurs mainly in Zero stage0).

I suggest adding a dtype check between the model and config during the execution of deepspeed.initialize and throwing an assert if they do not match. What do you think?

To Reproduce

Use the DeepSpeedExample with cifar.
Edit cifar10_deepspeed.py as follows:

+    net = net.half()
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
        args=args,
        model=net,
        model_parameters=parameters,
        training_data=trainset,
        config=ds_config,
    )

    # Get the local device name (str) and local rank (int).
    local_device = get_accelerator().device_name(model_engine.local_rank)
    local_rank = model_engine.local_rank

    # For float32, target_dtype will be None so no datatype conversion needed.
    target_dtype = None
    if model_engine.bfloat16_enabled():
        target_dtype = torch.bfloat16
    elif model_engine.fp16_enabled():
        target_dtype = torch.half
+    target_dtype = torch.half

Execute the following:

$ deepspeed --bind_cores_to_rank cifar10_deepspeed.py --dtype fp32 --stage 0

You will observe that the loss turns NaN.

[ 1,  2000] loss:  nan
[ 2,  2000] loss:  nan
[ 3,  2000] loss:  nan
[ 4,  2000] loss:  nan
[ 5,  2000] loss:  nan

The text was updated successfully, but these errors were encountered:

Taiki-azrs added bug Something isn't working training labels May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss #5509

[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss #5509

Taiki-azrs commented May 8, 2024

[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss #5509

[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss #5509

Comments

Taiki-azrs commented May 8, 2024