my loss jump and desrease #12718

xuxiaolin-github · 2024-05-16T03:45:25Z

Search before asking

I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

i train a best.pt use my car&person dataset, and use this best.pt as pretrained model to train other same car& person dataset, but the loss growing , afer jump 2 epoch ,loss decrease slowly. loss cant decrease to first epoch loss in 50 epoch.(50 epoch no improve will stop)

my batch is 4, i try to change default.yaml lr to 0.0025. but optimizer (optimizer: 'optimizer=auto' found, ignoring 'lr0=0.0025...

i want to know how to train will let loss not grow,but going down in the first time.

Additional

is the reason about lr,because my batch is 4

github-actions · 2024-05-16T03:45:52Z

👋 Hello @xuxiaolin-github, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2024-05-16T05:11:48Z

It sounds like you're experiencing instability in your loss during training, which often indicates issues with the learning rate or batch size settings.

Here are a couple of suggestions:

Adjust Learning Rate: Since you are using a smaller batch size, consider lowering the learning rate further than the one you tried. For a batch size of 4, a much smaller learning rate might stabilize training.
Gradual Warmup: Implement a learning rate warmup strategy where the learning rate gradually increases from a lower value to the intended one over several epochs. This can help stabilize the training in the initial phases.

Example of setting learning rate warmup in Python:

from ultralytics import YOLO

# Load your model
model = YOLO('path/to/best.pt')

# Train with custom learning rate and warmup
results = model.train(data='your_dataset.yaml', lr0=0.001, epochs=50, warmup_epochs=5)

Lastly, ensure your dataset is correctly annotated and normalized, as issues there can also cause unstable loss.

Let us know how it goes after trying these changes!

xuxiaolin-github · 2024-05-17T09:39:08Z

thanks, it works when i change optimizer=SGD & lr to 0.001 when batch is 4.
i use pretrained yolov8s.pt train BDD and my car&person dataset(at least 40000 images, same label with BDD). 230 epochs, 64 batch in NVIDIA V100, mAP is 60.5. loss is normal.
i use this best.pt train mini car&person dataset(about 20000 images) in my personal computer, batch set 4.

sorry,i cant upload train loss picture becuase company's network, so i will draw by this way
before, loss like this blow(train and val): first epoch is best

        1
       1    1
      1          1
      1              1
     1                   1
    1                        1
   1
  1

now lr=0.001, loss become this (train and val): 10 epoch is best

      1
     1 1
    1   1
    1    1                           1      1
    1    1                   1                        1
   1      1           1                                        1
   1       1    1
              1

and i want to know why loss still instability

i remeber i change another code and train mini dataset，because i want to prune the network through BN gamma, i disable the amp and add code in trainer:

l1_lambda = 1e-2 * (1 - 0.9 * epoch / self.epochs)
for k, m in self.model.named_modules():
    if isinstance(m, nn.BatchNorm2d):
        m.weight.grad.data.add_(l1_lambda * torch.sign(m.weight.data))
        m.bias.grad.data.add_(1e-2 * torch.sign(m.bias.data))

is this reason cause loss problem?

i will delete code and try again, see what loss happen.

glenn-jocher · 2024-05-19T22:28:42Z

@xuxiaolin-github it sounds like you're making good progress with your adjustments! Switching to SGD and reducing the learning rate to 0.001 for a smaller batch size seems to have helped stabilize your training to some extent. 🚀

Regarding the instability in loss you're still experiencing, the additional code you added for pruning through BN gamma could indeed be influencing the training dynamics. Modifying gradients directly during training, especially with a regularization term like you've added, can introduce significant variability in the loss, especially if the lambda value isn't carefully tuned relative to your learning rate and dataset size.

Removing or adjusting the pruning code is a good next step to see if it stabilizes the loss. Keep an eye on how the loss trends without these modifications and adjust the regularization strength if you decide to reintroduce it. Good luck, and let us know how it goes!

xuxiaolin-github · 2024-05-20T01:23:07Z

ok, thanks. i get the point, i will delete the code and train again

glenn-jocher · 2024-05-20T04:27:04Z

Great decision! Removing the pruning code should help clarify if that's impacting your loss stability. Keep us posted on how the training progresses after making this change. If you encounter any further issues or have questions, feel free to reach out. Happy training! 🚀

xuxiaolin-github added the question Further information is requested label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

my loss jump and desrease #12718

my loss jump and desrease #12718

xuxiaolin-github commented May 16, 2024

github-actions bot commented May 16, 2024

glenn-jocher commented May 16, 2024

xuxiaolin-github commented May 17, 2024

glenn-jocher commented May 19, 2024

xuxiaolin-github commented May 20, 2024

glenn-jocher commented May 20, 2024

my loss jump and desrease #12718

my loss jump and desrease #12718

Comments

xuxiaolin-github commented May 16, 2024

Search before asking

Question

Additional

github-actions bot commented May 16, 2024

Install

Environments

Status

glenn-jocher commented May 16, 2024

xuxiaolin-github commented May 17, 2024

glenn-jocher commented May 19, 2024

xuxiaolin-github commented May 20, 2024

glenn-jocher commented May 20, 2024