Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

my loss jump and desrease #12718

Open
1 task done
xuxiaolin-github opened this issue May 16, 2024 · 6 comments
Open
1 task done

my loss jump and desrease #12718

xuxiaolin-github opened this issue May 16, 2024 · 6 comments
Labels
question Further information is requested

Comments

@xuxiaolin-github
Copy link

Search before asking

Question

i train a best.pt use my car&person dataset, and use this best.pt as pretrained model to train other same car& person dataset, but the loss growing , afer jump 2 epoch ,loss decrease slowly. loss cant decrease to first epoch loss in 50 epoch.(50 epoch no improve will stop)

my batch is 4, i try to change default.yaml lr to 0.0025. but optimizer (optimizer: 'optimizer=auto' found, ignoring 'lr0=0.0025...

i want to know how to train will let loss not grow,but going down in the first time.

Additional

is the reason about lr,because my batch is 4

@xuxiaolin-github xuxiaolin-github added the question Further information is requested label May 16, 2024
Copy link

👋 Hello @xuxiaolin-github, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

It sounds like you're experiencing instability in your loss during training, which often indicates issues with the learning rate or batch size settings.

Here are a couple of suggestions:

  1. Adjust Learning Rate: Since you are using a smaller batch size, consider lowering the learning rate further than the one you tried. For a batch size of 4, a much smaller learning rate might stabilize training.

  2. Gradual Warmup: Implement a learning rate warmup strategy where the learning rate gradually increases from a lower value to the intended one over several epochs. This can help stabilize the training in the initial phases.

Example of setting learning rate warmup in Python:

from ultralytics import YOLO

# Load your model
model = YOLO('path/to/best.pt')

# Train with custom learning rate and warmup
results = model.train(data='your_dataset.yaml', lr0=0.001, epochs=50, warmup_epochs=5)

Lastly, ensure your dataset is correctly annotated and normalized, as issues there can also cause unstable loss.

Let us know how it goes after trying these changes!

@xuxiaolin-github
Copy link
Author

thanks, it works when i change optimizer=SGD & lr to 0.001 when batch is 4.
i use pretrained yolov8s.pt train BDD and my car&person dataset(at least 40000 images, same label with BDD). 230 epochs, 64 batch in NVIDIA V100, mAP is 60.5. loss is normal.
i use this best.pt train mini car&person dataset(about 20000 images) in my personal computer, batch set 4.

sorry,i cant upload train loss picture becuase company's network, so i will draw by this way
before, loss like this blow(train and val): first epoch is best

        1
       1    1
      1          1
      1              1
     1                   1
    1                        1
   1
  1

now lr=0.001, loss become this (train and val): 10 epoch is best

      1
     1 1
    1   1
    1    1                           1      1
    1    1                   1                        1
   1      1           1                                        1
   1       1    1
              1

and i want to know why loss still instability

i remeber i change another code and train mini dataset,because i want to prune the network through BN gamma, i disable the amp and add code in trainer:

l1_lambda = 1e-2 * (1 - 0.9 * epoch / self.epochs)
for k, m in self.model.named_modules():
    if isinstance(m, nn.BatchNorm2d):
        m.weight.grad.data.add_(l1_lambda * torch.sign(m.weight.data))
        m.bias.grad.data.add_(1e-2 * torch.sign(m.bias.data))

is this reason cause loss problem?

i will delete code and try again, see what loss happen.

@glenn-jocher
Copy link
Member

@xuxiaolin-github it sounds like you're making good progress with your adjustments! Switching to SGD and reducing the learning rate to 0.001 for a smaller batch size seems to have helped stabilize your training to some extent. 🚀

Regarding the instability in loss you're still experiencing, the additional code you added for pruning through BN gamma could indeed be influencing the training dynamics. Modifying gradients directly during training, especially with a regularization term like you've added, can introduce significant variability in the loss, especially if the lambda value isn't carefully tuned relative to your learning rate and dataset size.

Removing or adjusting the pruning code is a good next step to see if it stabilizes the loss. Keep an eye on how the loss trends without these modifications and adjust the regularization strength if you decide to reintroduce it. Good luck, and let us know how it goes!

@xuxiaolin-github
Copy link
Author

ok, thanks. i get the point, i will delete the code and train again

@glenn-jocher
Copy link
Member

Great decision! Removing the pruning code should help clarify if that's impacting your loss stability. Keep us posted on how the training progresses after making this change. If you encounter any further issues or have questions, feel free to reach out. Happy training! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants