Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overtrain fix #431

Merged
merged 4 commits into from
May 19, 2024
Merged

Overtrain fix #431

merged 4 commits into from
May 19, 2024

Conversation

rappc87
Copy link
Contributor

@rappc87 rappc87 commented May 15, 2024

now overtrain works as it should. after training for lowest_value+overtrain_threshold, if there is no decrease in lowest_value, it overtrains and the train stops.

now overtrain works as it should. after training for lowest_value+overtrain_threshold, if there is no decrease in lowest_value, it overtrains and the train stops.
@rappc87
Copy link
Contributor Author

rappc87 commented May 15, 2024

added saving best_epoch every time lowest_value changes

@aitronssesin
Copy link
Member

I think this doesn't work, I started training with 10 overtraining threshold and it started saving a lot of models randomly but then it removed all saved models and the training was so slow

C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=1 | step=40 | time=12:46:47 | training_speed=0:00:27 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_1e_40s_best_epoch.pth' (epoch 1 and step 40) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=2 | step=80 | time=12:47:14 | training_speed=0:00:22 | lowest_value=27.87265396118164 (epoch 2 and step 69) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_2e_80s_best_epoch.pth' (epoch 2 and step 80) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=3 | step=120 | time=12:47:41 | training_speed=0:00:21 | lowest_value=25.2319278717041 (epoch 3 and step 116) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_3e_120s_best_epoch.pth' (epoch 3 and step 120) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=4 | step=160 | time=12:48:08 | training_speed=0:00:21 | lowest_value=18.749269485473633 (epoch 4 and step 150) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_4e_160s_best_epoch.pth' (epoch 4 and step 160) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=5 | step=200 | time=12:48:33 | training_speed=0:00:21 | lowest_value=16.182971954345703 (epoch 5 and step 175) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_5e_200s_best_epoch.pth' (epoch 5 and step 200) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=6 | step=240 | time=12:48:57 | training_speed=0:00:20 | lowest_value=16.182971954345703 (epoch 5 and step 175) | Number of epochs remaining for overtraining: 9 C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=7 | step=280 | time=12:49:17 | training_speed=0:00:20 | lowest_value=13.52884292602539 (epoch 7 and step 268) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_7e_280s_best_epoch.pth' (epoch 7 and step 280) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=8 | step=320 | time=12:49:42 | training_speed=0:00:20 | lowest_value=13.52884292602539 (epoch 7 and step 268) | Number of epochs remaining for overtraining: 9 C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=9 | step=360 | time=12:50:03 | training_speed=0:00:21 | lowest_value=13.52884292602539 (epoch 7 and step 268) | Number of epochs remaining for overtraining: 8 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1\G_400.pth' (epoch 10) Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1\D_400.pth' (epoch 10) Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_10e_400s.pth' (epoch 10 and step 400) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=10 | step=400 | time=12:50:42 | training_speed=0:00:38 | lowest_value=12.121013641357422 (epoch 10 and step 377) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_10e_400s_best_epoch.pth' (epoch 10 and step 400) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=11 | step=440 | time=12:51:07 | training_speed=0:00:18 | lowest_value=11.236621856689453 (epoch 11 and step 400) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_11e_440s_best_epoch.pth' (epoch 11 and step 440) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=12 | step=480 | time=12:51:31 | training_speed=0:00:20 | lowest_value=10.71220874786377 (epoch 12 and step 454) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_12e_480s_best_epoch.pth' (epoch 12 and step 480) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=13 | step=520 | time=12:51:56 | training_speed=0:00:20 | lowest_value=10.71220874786377 (epoch 12 and step 454) | Number of epochs remaining for overtraining: 9 C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=14 | step=560 | time=12:52:16 | training_speed=0:00:20 | lowest_value=9.570732116699219 (epoch 14 and step 536) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_14e_560s_best_epoch.pth' (epoch 14 and step 560) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=15 | step=600 | time=12:52:42 | training_speed=0:00:21 | lowest_value=8.836076736450195 (epoch 15 and step 560) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_15e_600s_best_epoch.pth' (epoch 15 and step 600)

Saved files (the first epoch is the sync graph)
image

@rappc87
Copy link
Contributor Author

rappc87 commented May 19, 2024

I think this doesn't work, I started training with 10 overtraining threshold and it started saving a lot of models randomly but then it removed all saved models and the training was so slow

C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=1 | step=40 | time=12:46:47 | training_speed=0:00:27 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_1e_40s_best_epoch.pth' (epoch 1 and step 40) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=2 | step=80 | time=12:47:14 | training_speed=0:00:22 | lowest_value=27.87265396118164 (epoch 2 and step 69) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_2e_80s_best_epoch.pth' (epoch 2 and step 80) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=3 | step=120 | time=12:47:41 | training_speed=0:00:21 | lowest_value=25.2319278717041 (epoch 3 and step 116) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_3e_120s_best_epoch.pth' (epoch 3 and step 120) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=4 | step=160 | time=12:48:08 | training_speed=0:00:21 | lowest_value=18.749269485473633 (epoch 4 and step 150) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_4e_160s_best_epoch.pth' (epoch 4 and step 160) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=5 | step=200 | time=12:48:33 | training_speed=0:00:21 | lowest_value=16.182971954345703 (epoch 5 and step 175) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_5e_200s_best_epoch.pth' (epoch 5 and step 200) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=6 | step=240 | time=12:48:57 | training_speed=0:00:20 | lowest_value=16.182971954345703 (epoch 5 and step 175) | Number of epochs remaining for overtraining: 9 C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=7 | step=280 | time=12:49:17 | training_speed=0:00:20 | lowest_value=13.52884292602539 (epoch 7 and step 268) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_7e_280s_best_epoch.pth' (epoch 7 and step 280) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=8 | step=320 | time=12:49:42 | training_speed=0:00:20 | lowest_value=13.52884292602539 (epoch 7 and step 268) | Number of epochs remaining for overtraining: 9 C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=9 | step=360 | time=12:50:03 | training_speed=0:00:21 | lowest_value=13.52884292602539 (epoch 7 and step 268) | Number of epochs remaining for overtraining: 8 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1\G_400.pth' (epoch 10) Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1\D_400.pth' (epoch 10) Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_10e_400s.pth' (epoch 10 and step 400) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=10 | step=400 | time=12:50:42 | training_speed=0:00:38 | lowest_value=12.121013641357422 (epoch 10 and step 377) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_10e_400s_best_epoch.pth' (epoch 10 and step 400) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=11 | step=440 | time=12:51:07 | training_speed=0:00:18 | lowest_value=11.236621856689453 (epoch 11 and step 400) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_11e_440s_best_epoch.pth' (epoch 11 and step 440) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=12 | step=480 | time=12:51:31 | training_speed=0:00:20 | lowest_value=10.71220874786377 (epoch 12 and step 454) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_12e_480s_best_epoch.pth' (epoch 12 and step 480) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=13 | step=520 | time=12:51:56 | training_speed=0:00:20 | lowest_value=10.71220874786377 (epoch 12 and step 454) | Number of epochs remaining for overtraining: 9 C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=14 | step=560 | time=12:52:16 | training_speed=0:00:20 | lowest_value=9.570732116699219 (epoch 14 and step 536) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_14e_560s_best_epoch.pth' (epoch 14 and step 560) C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1 | epoch=15 | step=600 | time=12:52:42 | training_speed=0:00:21 | lowest_value=8.836076736450195 (epoch 15 and step 560) | Number of epochs remaining for overtraining: 10 Saved model 'C:\Users\Aitor\Downloads\IA_Applio\Applio\logs\Prueba1_15e_600s_best_epoch.pth' (epoch 15 and step 600)

Saved files (the first epoch is the sync graph) image

You said this fix doesn't work but I would like to summarize how it works.

From the screenshot I see that you have set it to save every 10 epochs.

If you set the overtraining threshold to 10 epochs, this means. Save another 10 epochs after the last recorded best_epoch.pth file and if there is no improvement, finish training because the model is overtraining.

and every time a new best_epoch is found it deletes the old best_epoch file because the previous best_epoch.pth file is no longer the best epoch.

So to summarize, if current_epoch > best_epoch+overtraining_threshold_value stop training bc of overtraining. and every time a new best epoch is found, save best_epoch.pth and delete the previous best epoch file.

@aitronssesin
Copy link
Member

We're gonna merge this pull request and give it a spin. If the overtraining detector looks sharper, we'll roll with the changes.

@aitronssesin aitronssesin reopened this May 19, 2024
@aitronssesin aitronssesin merged commit 8ce5723 into IAHispano:main May 19, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants