Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_second.py model.decoder error (output tensor is nan) #193

Open
junylee11 opened this issue Jan 22, 2024 · 4 comments
Open

train_second.py model.decoder error (output tensor is nan) #193

junylee11 opened this issue Jan 22, 2024 · 4 comments

Comments

@junylee11
Copy link

junylee11 commented Jan 22, 2024

The g_loss value in "train_second.py" is nan.
Debugging found that the output value of the model.decoder() function was nan. (line 391, line 402)
There was no problem in train_first.py, but I don't know why this problem occurs in train_second.py.

If you can fix these errors, please help me.
Thank you.

image image

log_dir: "C:\Users\user_\Desktop\styleTTS2_test_data"
first_stage_path: "first_stage.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 4
max_len: 200 # maximum number of frames
pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

config for decoder

decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5

@akshatgarg99
Copy link

Same issue

@effusiveperiscope
Copy link

I have experienced this before in a few situations:

  • actual model parameters are not being loaded from the checkpoint (there is some weird naming error involving "module" prefix between stages 1 and 2 & whether you are using distributed vs. non-distributed training; try changing strict loading to true and see what happens with keys)
  • multispeaker is set incorrectly
  • certain batch sizes with mixed precision (try changing batch sizes)

@yl4579
Copy link
Owner

yl4579 commented Mar 7, 2024

Have you checked whether F0_fake, N_fake, s or en are all not NaN?

@suryasubbu
Copy link

Have you checked whether F0_fake, N_fake, s or en are all not NaN?

All the above are not NAN
The problem starts with model.decoder() process where y_rec_gt_pred becomes nan even though the arguments are not nan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants