train_second.py model.decoder error (output tensor is nan) #193

junylee11 · 2024-01-22T07:09:31Z

The g_loss value in "train_second.py" is nan.
Debugging found that the output value of the model.decoder() function was nan. (line 391, line 402)
There was no problem in train_first.py, but I don't know why this problem occurs in train_second.py.

If you can fix these errors, please help me.
Thank you.

log_dir: "C:\Users\user_\Desktop\styleTTS2_test_data"
first_stage_path: "first_stage.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 4
max_len: 200 # maximum number of frames
pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

config for decoder

decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5

akshatgarg99 · 2024-01-28T07:31:37Z

Same issue

effusiveperiscope · 2024-01-29T20:00:06Z

I have experienced this before in a few situations:

actual model parameters are not being loaded from the checkpoint (there is some weird naming error involving "module" prefix between stages 1 and 2 & whether you are using distributed vs. non-distributed training; try changing strict loading to true and see what happens with keys)
multispeaker is set incorrectly
certain batch sizes with mixed precision (try changing batch sizes)

yl4579 · 2024-03-07T04:41:48Z

Have you checked whether F0_fake, N_fake, s or en are all not NaN?

suryasubbu · 2024-05-12T05:36:27Z

Have you checked whether F0_fake, N_fake, s or en are all not NaN?

All the above are not NAN
The problem starts with model.decoder() process where y_rec_gt_pred becomes nan even though the arguments are not nan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_second.py model.decoder error (output tensor is nan) #193

train_second.py model.decoder error (output tensor is nan) #193

junylee11 commented Jan 22, 2024 •

edited

akshatgarg99 commented Jan 28, 2024

effusiveperiscope commented Jan 29, 2024

yl4579 commented Mar 7, 2024

suryasubbu commented May 12, 2024

train_second.py model.decoder error (output tensor is nan) #193

train_second.py model.decoder error (output tensor is nan) #193

Comments

junylee11 commented Jan 22, 2024 • edited

config for decoder

akshatgarg99 commented Jan 28, 2024

effusiveperiscope commented Jan 29, 2024

yl4579 commented Mar 7, 2024

suryasubbu commented May 12, 2024

junylee11 commented Jan 22, 2024 •

edited