Performance improvement to improve training time #11718

edkazcarlson · 2024-05-07T04:30:26Z

In this PR I am cleaning up some existing code to help improve training times.
The main speedups come from not running transforms that are passed with a 0% chance of applying their transform but I also introduced jit compiled methods with the numba library to help handle some operations that are ran frequently.
If the team doesn't want to use numba (code clutter, licensing, etc) I'll remove as while they did introduce some speedup, a majority of the speedup was from the changes to the image transformations.

Tested with the following code

start_time = time.time()

modelName = f'yolov8n.yaml' 
overrides = {'epochs': 6, 'imgsz': 640, 'data': 'coco.yaml', 'model': modelName, 'batch': 8, 'close_mosaic': 3}
trainer = DetectionTrainer(overrides=overrides)
trainer.train()

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

Overall with my changes it took 9754 seconds to train for 6 epochs (3 with mosaic, 3 without) where it would take 10531 without my changes.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Optimizations and improvements in YOLOv8 model processing and data augmentation techniques.

📊 Key Changes

Modified bounding box scaling to be more flexible and specific.
Implemented enhancements in data augmentation through numba for faster execution.
Streamlined and optimized various operations (flipping, translation, etc.) using numba.
Adjustments in dataset handling for more efficient processing and augmentation application.

🎯 Purpose & Impact

Enhanced Performance: The use of numba for just-in-time compilation significantly speeds up data processing, especially in augmentation tasks, leading to reduced training times.
Improved Accuracy: By refining how bounding boxes and images are scaled and transformed, the model can potentially achieve better training accuracy.
Flexible Data Handling: Changes in the way datasets and images are augmented allow for more complex and varied transformations, which can help the model generalize better over diverse data sets.

glenn-jocher · 2024-05-08T14:08:49Z

@Laughing-q interesting training speedup PR. I think we'd like to merge this without the numba addition as it may be hardware specific and we'd strongly prefer to avoid adding additional dependencies.

codecov · 2024-05-09T01:57:24Z

Codecov Report

Attention: Patch coverage is 78.43137% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 70.41%. Comparing base (51c3169) to head (3c64c6f).

Files	Patch %	Lines
ultralytics/utils/instance.py	63.63%	8 Missing ⚠️
ultralytics/data/augment.py	94.73%	1 Missing ⚠️
ultralytics/utils/loss.py	0.00%	1 Missing ⚠️
ultralytics/utils/ops.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11718      +/-   ##
==========================================
- Coverage   74.59%   70.41%   -4.19%     
==========================================
  Files         124      124              
  Lines       15664    15693      +29     
==========================================
- Hits        11685    11050     -635     
- Misses       3979     4643     +664

Flag	Coverage Δ
Benchmarks	`35.43% <39.21%> (-0.08%)`	⬇️
GPU	`37.25% <31.37%> (-5.69%)`	⬇️
Tests	`66.52% <78.43%> (-3.77%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

edkazcarlson · 2024-05-09T20:20:59Z

slightly slower than before when removing numba but still faster than the current state of main

glenn-jocher · 2024-05-10T06:47:30Z

Thanks for the update! It's great to hear that the performance improved even without numba. If you can share the specific metrics or any additional insights from your latest tests, that would be helpful for finalizing the merge. Let's aim for the best balance of dependency minimization and performance enhancement. 🚀

edkazcarlson · 2024-05-11T03:57:49Z

Thanks for the update! It's great to hear that the performance improved even without numba. If you can share the specific metrics or any additional insights from your latest tests, that would be helpful for finalizing the merge. Let's aim for the best balance of dependency minimization and performance enhancement. 🚀

In short my main changes currently are:

Taking better advantage of vectorized methods in order to boost perf
Not applying transforms if they are passed a 0% probability of running, thus saving us (# of transforms x images x epochs x cpu clocks it takes to run random.random())

I don't have any specific metrics outside of wallclock time for the epochs. Is there something specific the team would want?
Thanks :)

glenn-jocher · 2024-05-11T07:17:45Z

Thanks for detailing the changes! The approach sounds solid, particularly your method to skip transformations when their probability is zero—definitely a smart optimization. 😊

For metrics, if you could provide us with a comparison in wall-clock time between the main branch and your changes (i.e., how much total time each epoch takes on average) across a few runs, that would be ideal. This will help us quantitatively assess the impact of your improvements.

Keep up the fantastic work! Looking forward to integrating these enhancements. 🌟

Laughing-q · 2024-05-13T13:00:29Z

@edkazcarlson Hi, thanks for the PR!
I tested your changes locally but the results I got are almost the same as main branch. Here's my results:
on main branch:

on current PR:

And my testing command:

yolo train detect data=coco.yaml model=yolov8m.yaml batch=64 epochs=4 close_mosaic=2 device=0,1,2,3

edkazcarlson · 2024-05-15T04:15:12Z

@Laughing-q Thank you for the tests on your hardware, could you try doing this through python itself? Not sure where exactly the yolo command is pointing, could it be that it's pointing at the install you have through pip and not my branch (can you confirm through using which ?)
I haven't had much of a chance between work and other stuff so haven't gotten much of a chance to do in depth tests but from what I can see just initially comparing my old tests to some new perf tests it seems like the new main is slower, checking out an old commit (ex: 1365fe9 ) seems to be faster than the branch thats merged w/ main.
I plan to keep investigating this slowdown in the merged commits but should we merge this into main for the time being just so this branch doesnt get stale? I don't see this PR hurting perf in any way (can change title just for better record keeping)

edkazcarlson · 2024-05-15T20:28:57Z

Also side question, I know numba got denied due to hardware compat reasons, but does the team accept cython improvements?

glenn-jocher · 2024-05-20T06:22:26Z

Hi there! Thanks for your continued contributions and for checking in about Cython. Yes, we're open to considering Cython improvements as they can be a great way to enhance performance while maintaining compatibility across various hardware setups. If you have specific optimizations in mind using Cython, feel free to share them or open a PR. We'd love to take a look! 🚀

edkazcarlson · 2024-05-29T02:07:54Z

Thank you for your patience, after a number of different tests, I realized that a huge portion of my performance gains were likely caused by the (slightly) better cooling that my laptops stand provides, where I was getting 6-8% better performance w/ the laptop stand than w/o. Slightly embarrassing but I'll try to keep this in mind for any future performance work I do, apologies for the initial confusion and overblown estimate.

When comparing w/ my changes and w/o my changes though, I am still getting ~5% faster with laptop stand and ~3.5% without laptop stand.

I used the following code (6 epochs, 3 with mosaic, 3 without) in order to test.
With my changes, my times across multiple sessions were
(without laptop stand)
10131 seconds
9682
10084
for an average of 9965 seconds
while the main branch got
10390
10263
for an average of 10326sec, meaning my changes took (9965 / 10326) = 96.503970559% of the duration main did

For completeness, with the stand my changes took 9223s and 9311 (avg 9267) vs main having 9697s and 9804s (avg 9750s), meaning with my changes it took 95.046153846% as long

from ultralytics.models.yolo.detect import DetectionTrainer
import time

start_time = time.time()

modelName = f'yolov8n.yaml' 
overrides = {'epochs': 6, 'imgsz': 640, 'data': 'coco.yaml', 'model': modelName, 'batch': 8, 'close_mosaic': 3}
trainer = DetectionTrainer(overrides=overrides)
trainer.train()

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

glenn-jocher · 2024-05-29T05:20:18Z

Thank you for the detailed update and for your honesty about the cooling factor—it's an interesting observation that highlights how various environmental factors can impact performance testing. 🌡️

Your continued efforts and the results you've shared are valuable. A consistent 3.5% to 5% improvement in training time is still quite significant, especially when scaled across multiple training sessions and models. It's clear that your changes are having a positive impact, even if the initial estimates were influenced by external factors.

Let's proceed with integrating your changes into the main branch. This will allow us to benefit from these improvements and also keep the project moving forward efficiently. Great work, and looking forward to more of your contributions! 🚀

edkazcarlson and others added 30 commits December 11, 2023 20:39

push prog

7499ddb

stash

4a7e4de

re add basic torch transforms

8baee42

backup

3a1b098

stash

e437c44

stash

7b9f5a3

stash

bd779a0

allow for 3+ channel inputs, fixed gitignores

f679f1b

stash

cdd3243

stash

78e4e4f

stash

894dc12

working on local, 20 min epochs

f8e684c

merge w main

0f662f4

stash

08ef26f

stash

37a5879

stash

1edf71f

perm check

6424d14

stash

54ee366

space

4fabad1

Auto-format by https://ultralytics.com/actions

a312520

stash

7d974b2

initial round of numba

65dd749

update mosaic to use jit compilation

7e10775

re align changes

4c77816

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

0d6093a

merge w main

bb5a2e1

update stragglers

7fde974

finish other mosaic numbaing

394b410

working across all 3 4 (

4c46b2e

clean up prints

75ad712

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

8f60896

glenn-jocher added the TODO Items that needs completing label May 8, 2024

edkazcarlson and others added 4 commits May 8, 2024 18:19

revert numba

6514f8f

Auto-format by https://ultralytics.com/actions

60d44c4

merge w upstream

c06d3c2

Auto-format by https://ultralytics.com/actions

8c8f57d

updating ops to use np pi

1365fe9

Burhan-Q added the enhancement New feature or request label May 10, 2024

glenn-jocher and others added 5 commits May 11, 2024 19:26

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

890460d

Update default.yaml

f01d68e

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

d398d6f

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

b46a882

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

4f1f18c

glenn-jocher removed the TODO Items that needs completing label May 15, 2024

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

92c63fe

edkazcarlson changed the title ~~Performance improvement to improve training time by up to ~14%~~ Performance improvement to improve training time May 15, 2024

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

0dada3b

Merge branch 'main' into user/ecarlson/isolated-numba-improvements

3c64c6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement to improve training time #11718

Performance improvement to improve training time #11718

edkazcarlson commented May 7, 2024 •

edited by github-actions bot

glenn-jocher commented May 8, 2024

codecov bot commented May 9, 2024 •

edited

edkazcarlson commented May 9, 2024

glenn-jocher commented May 10, 2024

edkazcarlson commented May 11, 2024

glenn-jocher commented May 11, 2024

Laughing-q commented May 13, 2024 •

edited

edkazcarlson commented May 15, 2024 •

edited

edkazcarlson commented May 15, 2024

glenn-jocher commented May 20, 2024

edkazcarlson commented May 29, 2024

glenn-jocher commented May 29, 2024

Performance improvement to improve training time #11718

Are you sure you want to change the base?

Performance improvement to improve training time #11718

Conversation

edkazcarlson commented May 7, 2024 • edited by github-actions bot

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

glenn-jocher commented May 8, 2024

codecov bot commented May 9, 2024 • edited

Codecov Report

edkazcarlson commented May 9, 2024

glenn-jocher commented May 10, 2024

edkazcarlson commented May 11, 2024

glenn-jocher commented May 11, 2024

Laughing-q commented May 13, 2024 • edited

edkazcarlson commented May 15, 2024 • edited

edkazcarlson commented May 15, 2024

glenn-jocher commented May 20, 2024

edkazcarlson commented May 29, 2024

glenn-jocher commented May 29, 2024

edkazcarlson commented May 7, 2024 •

edited by github-actions bot

codecov bot commented May 9, 2024 •

edited

Laughing-q commented May 13, 2024 •

edited

edkazcarlson commented May 15, 2024 •

edited