Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement to improve training time #11718

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

edkazcarlson
Copy link

@edkazcarlson edkazcarlson commented May 7, 2024

In this PR I am cleaning up some existing code to help improve training times.
The main speedups come from not running transforms that are passed with a 0% chance of applying their transform but I also introduced jit compiled methods with the numba library to help handle some operations that are ran frequently.
If the team doesn't want to use numba (code clutter, licensing, etc) I'll remove as while they did introduce some speedup, a majority of the speedup was from the changes to the image transformations.

Tested with the following code

start_time = time.time()

modelName = f'yolov8n.yaml' 
overrides = {'epochs': 6, 'imgsz': 640, 'data': 'coco.yaml', 'model': modelName, 'batch': 8, 'close_mosaic': 3}
trainer = DetectionTrainer(overrides=overrides)
trainer.train()

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

image
Overall with my changes it took 9754 seconds to train for 6 epochs (3 with mosaic, 3 without) where it would take 10531 without my changes.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Optimizations and improvements in YOLOv8 model processing and data augmentation techniques.

📊 Key Changes

  • Modified bounding box scaling to be more flexible and specific.
  • Implemented enhancements in data augmentation through numba for faster execution.
  • Streamlined and optimized various operations (flipping, translation, etc.) using numba.
  • Adjustments in dataset handling for more efficient processing and augmentation application.

🎯 Purpose & Impact

  • Enhanced Performance: The use of numba for just-in-time compilation significantly speeds up data processing, especially in augmentation tasks, leading to reduced training times.
  • Improved Accuracy: By refining how bounding boxes and images are scaled and transformed, the model can potentially achieve better training accuracy.
  • Flexible Data Handling: Changes in the way datasets and images are augmented allow for more complex and varied transformations, which can help the model generalize better over diverse data sets.

@glenn-jocher glenn-jocher added the TODO Items that needs completing label May 8, 2024
@glenn-jocher
Copy link
Member

@Laughing-q interesting training speedup PR. I think we'd like to merge this without the numba addition as it may be hardware specific and we'd strongly prefer to avoid adding additional dependencies.

Copy link

codecov bot commented May 9, 2024

Codecov Report

Attention: Patch coverage is 78.43137% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 70.41%. Comparing base (51c3169) to head (3c64c6f).

Files Patch % Lines
ultralytics/utils/instance.py 63.63% 8 Missing ⚠️
ultralytics/data/augment.py 94.73% 1 Missing ⚠️
ultralytics/utils/loss.py 0.00% 1 Missing ⚠️
ultralytics/utils/ops.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11718      +/-   ##
==========================================
- Coverage   74.59%   70.41%   -4.19%     
==========================================
  Files         124      124              
  Lines       15664    15693      +29     
==========================================
- Hits        11685    11050     -635     
- Misses       3979     4643     +664     
Flag Coverage Δ
Benchmarks 35.43% <39.21%> (-0.08%) ⬇️
GPU 37.25% <31.37%> (-5.69%) ⬇️
Tests 66.52% <78.43%> (-3.77%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@edkazcarlson
Copy link
Author

image
slightly slower than before when removing numba but still faster than the current state of main

@Burhan-Q Burhan-Q added the enhancement New feature or request label May 10, 2024
@glenn-jocher
Copy link
Member

Thanks for the update! It's great to hear that the performance improved even without numba. If you can share the specific metrics or any additional insights from your latest tests, that would be helpful for finalizing the merge. Let's aim for the best balance of dependency minimization and performance enhancement. 🚀

@edkazcarlson
Copy link
Author

Thanks for the update! It's great to hear that the performance improved even without numba. If you can share the specific metrics or any additional insights from your latest tests, that would be helpful for finalizing the merge. Let's aim for the best balance of dependency minimization and performance enhancement. 🚀

In short my main changes currently are:

  1. Taking better advantage of vectorized methods in order to boost perf
  2. Not applying transforms if they are passed a 0% probability of running, thus saving us (# of transforms x images x epochs x cpu clocks it takes to run random.random())

I don't have any specific metrics outside of wallclock time for the epochs. Is there something specific the team would want?
Thanks :)

@glenn-jocher
Copy link
Member

Thanks for detailing the changes! The approach sounds solid, particularly your method to skip transformations when their probability is zero—definitely a smart optimization. 😊

For metrics, if you could provide us with a comparison in wall-clock time between the main branch and your changes (i.e., how much total time each epoch takes on average) across a few runs, that would be ideal. This will help us quantitatively assess the impact of your improvements.

Keep up the fantastic work! Looking forward to integrating these enhancements. 🌟

@Laughing-q
Copy link
Member

Laughing-q commented May 13, 2024

@edkazcarlson Hi, thanks for the PR!
I tested your changes locally but the results I got are almost the same as main branch. Here's my results:
on main branch:
pic-240513-1946-15
on current PR:
pic-240513-2037-44
And my testing command:

yolo train detect data=coco.yaml model=yolov8m.yaml batch=64 epochs=4 close_mosaic=2 device=0,1,2,3

@edkazcarlson
Copy link
Author

edkazcarlson commented May 15, 2024

@Laughing-q Thank you for the tests on your hardware, could you try doing this through python itself? Not sure where exactly the yolo command is pointing, could it be that it's pointing at the install you have through pip and not my branch (can you confirm through using which ?)
I haven't had much of a chance between work and other stuff so haven't gotten much of a chance to do in depth tests but from what I can see just initially comparing my old tests to some new perf tests it seems like the new main is slower, checking out an old commit (ex: 1365fe9 ) seems to be faster than the branch thats merged w/ main.
I plan to keep investigating this slowdown in the merged commits but should we merge this into main for the time being just so this branch doesnt get stale? I don't see this PR hurting perf in any way (can change title just for better record keeping)

@glenn-jocher glenn-jocher removed the TODO Items that needs completing label May 15, 2024
@edkazcarlson edkazcarlson changed the title Performance improvement to improve training time by up to ~14% Performance improvement to improve training time May 15, 2024
@edkazcarlson
Copy link
Author

Also side question, I know numba got denied due to hardware compat reasons, but does the team accept cython improvements?

@glenn-jocher
Copy link
Member

Hi there! Thanks for your continued contributions and for checking in about Cython. Yes, we're open to considering Cython improvements as they can be a great way to enhance performance while maintaining compatibility across various hardware setups. If you have specific optimizations in mind using Cython, feel free to share them or open a PR. We'd love to take a look! 🚀

@edkazcarlson
Copy link
Author

Thank you for your patience, after a number of different tests, I realized that a huge portion of my performance gains were likely caused by the (slightly) better cooling that my laptops stand provides, where I was getting 6-8% better performance w/ the laptop stand than w/o. Slightly embarrassing but I'll try to keep this in mind for any future performance work I do, apologies for the initial confusion and overblown estimate.

When comparing w/ my changes and w/o my changes though, I am still getting ~5% faster with laptop stand and ~3.5% without laptop stand.

I used the following code (6 epochs, 3 with mosaic, 3 without) in order to test.
With my changes, my times across multiple sessions were
(without laptop stand)
10131 seconds
9682
10084
for an average of 9965 seconds
while the main branch got
10390
10263
for an average of 10326sec, meaning my changes took (9965 / 10326) = 96.503970559% of the duration main did

For completeness, with the stand my changes took 9223s and 9311 (avg 9267) vs main having 9697s and 9804s (avg 9750s), meaning with my changes it took 95.046153846% as long

from ultralytics.models.yolo.detect import DetectionTrainer
import time

start_time = time.time()

modelName = f'yolov8n.yaml' 
overrides = {'epochs': 6, 'imgsz': 640, 'data': 'coco.yaml', 'model': modelName, 'batch': 8, 'close_mosaic': 3}
trainer = DetectionTrainer(overrides=overrides)
trainer.train()

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

@glenn-jocher
Copy link
Member

Thank you for the detailed update and for your honesty about the cooling factor—it's an interesting observation that highlights how various environmental factors can impact performance testing. 🌡️

Your continued efforts and the results you've shared are valuable. A consistent 3.5% to 5% improvement in training time is still quite significant, especially when scaled across multiple training sessions and models. It's clear that your changes are having a positive impact, even if the initial estimates were influenced by external factors.

Let's proceed with integrating your changes into the main branch. This will allow us to benefit from these improvements and also keep the project moving forward efficiently. Great work, and looking forward to more of your contributions! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants