Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training regression for Conformer-Transducer models #2533

Closed
asumagic opened this issue Apr 30, 2024 · 2 comments · Fixed by #2548
Closed

Training regression for Conformer-Transducer models #2533

asumagic opened this issue Apr 30, 2024 · 2 comments · Fixed by #2548
Assignees
Labels
bug Something isn't working important
Milestone

Comments

@asumagic
Copy link
Collaborator

asumagic commented Apr 30, 2024

Describe the bug

I am not sure if it is specific to the Conformer-Transducer, but I noticed that the current develop commit fails to converge, so the WER gets stuck at 100% and the loss gets stuck around ~106 past a few epochs.

I have started bisecting the issue. 7c63cf2c436805ad368292666a7bee1debd9fa46 is good, 0a91bc09fbafd32e25bd299534178da9a0d23a6d is bad. This is using the same environment in both cases, with the default config.

Expected behaviour

The model should converge.

To Reproduce

python3 train.py hparams/conformer_transducer.yaml --data_folder /corpus/LibriSpeech/ --precision=fp16

Environment Details

No response

Relevant Log Output

Bad training:

epoch: 1, lr: 1.95e-04, steps: 6106, optimizer: AdamW - train loss: 1.26e+02 - valid loss: 9.05e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 2, lr: 3.91e-04, steps: 12212, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 8.03e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 3, lr: 5.86e-04, steps: 18318, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.45e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 4, lr: 7.82e-04, steps: 24424, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.20e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 5, lr: 7.24e-04, steps: 30530, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.58e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 6, lr: 6.61e-04, steps: 36636, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.69e-01, valid CER: 1.00e+02, valid WER: 1.00e+02

Good training:

epoch: 1, lr: 1.95e-04, steps: 6106, optimizer: AdamW - train loss: 1.26e+02 - valid loss: 8.81e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 2, lr: 3.91e-04, steps: 12212, optimizer: AdamW - train loss: 81.68 - valid loss: 1.61e-01, valid CER: 21.03, valid WER: 36.21
epoch: 3, lr: 5.86e-04, steps: 18318, optimizer: AdamW - train loss: 51.32 - valid loss: 8.18e-02, valid CER: 9.02, valid WER: 19.53
epoch: 4, lr: 7.82e-04, steps: 24424, optimizer: AdamW - train loss: 43.26 - valid loss: 6.21e-02, valid CER: 6.48, valid WER: 14.99
epoch: 5, lr: 7.24e-04, steps: 30530, optimizer: AdamW - train loss: 38.38 - valid loss: 4.74e-02, valid CER: 4.68, valid WER: 11.33
epoch: 6, lr: 6.61e-04, steps: 36636, optimizer: AdamW - train loss: 34.17 - valid loss: 4.04e-02, valid CER: 3.83, valid WER: 9.69
epoch: 7, lr: 6.12e-04, steps: 42742, optimizer: AdamW - train loss: 31.40 - valid loss: 3.62e-02, valid CER: 3.47, valid WER: 8.68

Additional Context

No response

@asumagic asumagic added the bug Something isn't working label Apr 30, 2024
@asumagic asumagic self-assigned this Apr 30, 2024
@asumagic asumagic added this to the v1.0.1 milestone Apr 30, 2024
@TParcollet
Copy link
Collaborator

I've seen a few changes to RelPos and InputNorm going through I believe, could this be related?

@asumagic
Copy link
Collaborator Author

I've seen a few changes to RelPos and InputNorm going through I believe, could this be related?

Maybe, it was fine in inference though. The InputNorm change doesn't even affect the mode used by the model, and I couldn't find an issue in the RelPos change... We'll see what the bisecting says anyway, it should be done by today anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants