Training regression for Conformer-Transducer models #2533

asumagic · 2024-04-30T07:30:47Z

Describe the bug

I am not sure if it is specific to the Conformer-Transducer, but I noticed that the current develop commit fails to converge, so the WER gets stuck at 100% and the loss gets stuck around ~106 past a few epochs.

I have started bisecting the issue. 7c63cf2c436805ad368292666a7bee1debd9fa46 is good, 0a91bc09fbafd32e25bd299534178da9a0d23a6d is bad. This is using the same environment in both cases, with the default config.

Expected behaviour

The model should converge.

To Reproduce

python3 train.py hparams/conformer_transducer.yaml --data_folder /corpus/LibriSpeech/ --precision=fp16

Environment Details

No response

Relevant Log Output

Bad training:

epoch: 1, lr: 1.95e-04, steps: 6106, optimizer: AdamW - train loss: 1.26e+02 - valid loss: 9.05e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 2, lr: 3.91e-04, steps: 12212, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 8.03e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 3, lr: 5.86e-04, steps: 18318, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.45e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 4, lr: 7.82e-04, steps: 24424, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.20e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 5, lr: 7.24e-04, steps: 30530, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.58e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 6, lr: 6.61e-04, steps: 36636, optimizer: AdamW - train loss: 1.06e+02 - valid loss: 7.69e-01, valid CER: 1.00e+02, valid WER: 1.00e+02

Good training:

epoch: 1, lr: 1.95e-04, steps: 6106, optimizer: AdamW - train loss: 1.26e+02 - valid loss: 8.81e-01, valid CER: 1.00e+02, valid WER: 1.00e+02
epoch: 2, lr: 3.91e-04, steps: 12212, optimizer: AdamW - train loss: 81.68 - valid loss: 1.61e-01, valid CER: 21.03, valid WER: 36.21
epoch: 3, lr: 5.86e-04, steps: 18318, optimizer: AdamW - train loss: 51.32 - valid loss: 8.18e-02, valid CER: 9.02, valid WER: 19.53
epoch: 4, lr: 7.82e-04, steps: 24424, optimizer: AdamW - train loss: 43.26 - valid loss: 6.21e-02, valid CER: 6.48, valid WER: 14.99
epoch: 5, lr: 7.24e-04, steps: 30530, optimizer: AdamW - train loss: 38.38 - valid loss: 4.74e-02, valid CER: 4.68, valid WER: 11.33
epoch: 6, lr: 6.61e-04, steps: 36636, optimizer: AdamW - train loss: 34.17 - valid loss: 4.04e-02, valid CER: 3.83, valid WER: 9.69
epoch: 7, lr: 6.12e-04, steps: 42742, optimizer: AdamW - train loss: 31.40 - valid loss: 3.62e-02, valid CER: 3.47, valid WER: 8.68

Additional Context

No response

The text was updated successfully, but these errors were encountered:

TParcollet · 2024-04-30T09:00:59Z

I've seen a few changes to RelPos and InputNorm going through I believe, could this be related?

asumagic · 2024-04-30T09:04:44Z

I've seen a few changes to RelPos and InputNorm going through I believe, could this be related?

Maybe, it was fine in inference though. The InputNorm change doesn't even affect the mode used by the model, and I couldn't find an issue in the RelPos change... We'll see what the bisecting says anyway, it should be done by today anyway.

asumagic added the bug Something isn't working label Apr 30, 2024

asumagic self-assigned this Apr 30, 2024

asumagic added this to the v1.0.1 milestone Apr 30, 2024

asumagic added the important label May 7, 2024

asumagic mentioned this issue May 13, 2024

Less aggressive SpecAug to reduce(?) divergence #2548

Merged

15 tasks

asumagic closed this as completed in #2548 May 16, 2024

asumagic mentioned this issue May 21, 2024

AMD ROCm: Conformer-transducer diverges #2551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training regression for Conformer-Transducer models #2533

Training regression for Conformer-Transducer models #2533

asumagic commented Apr 30, 2024 •

edited

TParcollet commented Apr 30, 2024

asumagic commented Apr 30, 2024

Training regression for Conformer-Transducer models #2533

Training regression for Conformer-Transducer models #2533

Comments

asumagic commented Apr 30, 2024 • edited

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

TParcollet commented Apr 30, 2024

asumagic commented Apr 30, 2024

asumagic commented Apr 30, 2024 •

edited