Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoints not saved due to wrong loss comparison? #9168

Open
riqiang-dp opened this issue May 11, 2024 · 0 comments
Open

checkpoints not saved due to wrong loss comparison? #9168

riqiang-dp opened this issue May 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@riqiang-dp
Copy link

Describe the bug

I'm using val_loss as criteria to compare checkpoints and save top k. However, unlike WER, there seems to be some weird miscalculation happening: during training, I would see 'val_loss' was not in top {k}, but then I check the checkpoints directory, the latest model's val_loss is definitely within top {k}. An example is given in the image (the fact that the files are sorted by name and this last checkpoint appears within two other checkpoints indicates the checkpoint's loss is at least better than the kth checkpoint):
image
I saw this a year ago and didn't think much of it, checked pytorch lightning code and didn't find anything weird there. For WER it seems to work fine. After all this time I still stumbled upon this bug, I think it's kinda weird and don't know where else to start debugging.

Steps/Code to reproduce bug

exp_manager:
  exp_dir: null
  name: ${name}
  version: trial_1
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: "val_loss"
    mode: "min"
    save_top_k: 15
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  resume_if_exists: true
  resume_ignore_no_checkpoint: true

Expected behavior

Should save the model in the picture in the top k.

Environment overview (please complete the following information)

  • Environment location: GCP
  • Method of NeMo install: from source

Environment details

  • PyTorch version: 2.1
  • Python version: 3.11

Additional context

@riqiang-dp riqiang-dp added the bug Something isn't working label May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant