Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colorization training isn't working #37

Open
omerb01 opened this issue Sep 1, 2022 · 24 comments
Open

Colorization training isn't working #37

omerb01 opened this issue Sep 1, 2022 · 24 comments

Comments

@omerb01
Copy link

omerb01 commented Sep 1, 2022

I downloaded the flicker25k dataset, preprocessed it and train a model with these modifications in the config file:

  • batch size of 256 among 4 GPUs (thus total batch size of 1024)
  • image resolution 64x64

The rest of the configurations remained as in the current config file.
Even after 1000 training epochs, the model still produces bad results.

Is there anything I'm missing? Thanks.

@xenova
Copy link

xenova commented Sep 2, 2022

I'm also experiencing the same issues. Are your results also very unsaturated?

@xenova
Copy link

xenova commented Sep 2, 2022

I'm not sure if you have tried this, but what about setting "clip_denoised" to False (instead of True, which is the default)? It might result in more saturated results.

^ I will try this for my task and let you know how it goes

@omerb01
Copy link
Author

omerb01 commented Sep 2, 2022

Thanks @xenova , waiting for your update

@xenova
Copy link

xenova commented Sep 2, 2022

After training for another 3 hours with clip_denoised=False, I haven't seen any improvement. Perhaps @Janspiry can provide some extra assistance.

@ksunho9508
Copy link

@xenova @omerb01 Hello, did you solve the issue? I am still having problems in colorization task.

@xenova
Copy link

xenova commented Sep 16, 2022

@xenova @omerb01 Hello, did you solve the issue? I am still having problems in colorization task.

Nope, still struggling with colorization

@omerb01
Copy link
Author

omerb01 commented Sep 16, 2022

@ksunho9508 @xenova I am still unable to obtain reliable results. In my opinion, the flicker dataset does not contain enough data to generalize this task via diffusion based methods. The authors of the original paper applied their method to the ImageNet dataset, which contains much more training data.

@Janspiry
Copy link
Owner

Hi, guys, sorry for this problem.

Like @omerb01 said, I share the same view that the flicker dataset is too small to coloration for natural scenes.
May you should do this task in the ImageNet or Places2. More information can be found in #17

@xenova
Copy link

xenova commented Sep 16, 2022

@Janspiry I've also tried on my custom dataset (with millions of images), and I get the same results :/ ... I'm really not sure how this is the only task that is facing these issues; all other tasks seem to work fine.

@Janspiry
Copy link
Owner

@xenova I'll make sure there are no bugs in the coloration part of the code

@KSH0660
Copy link

KSH0660 commented Sep 16, 2022

@Janspiry Thank you. And can you add config file of super resolution too?

@edcson
Copy link

edcson commented Nov 22, 2022

I also found this problem. I used my own small-scale data set to train it, but still failed to get results after many epoch。 @Janspiry

@kkamankun
Copy link

@omerb01
Have you tried running experiments under the same conditions after changing GroupNorm to BatchNorm? It seems that using BatchNorm instead of GroupNorm can perform colorization to some extent by distinguishing between the background and objects.

@AlanZhang1995
Copy link

I experienced the same problem.
BTW, have you guys checked the training log? According to mine, it seems that the network is sufferred from a severe overfiting:
'''
INFO: Begin model train.
INFO: train/mse_loss: 0.1167483588039875
INFO: train/mse_loss: 0.0724316855113022
INFO: train/mse_loss: 0.06527451830048543
INFO: epoch: 1
INFO: iters: 23488
INFO: train/mse_loss: 0.020401993506137254
INFO: train/mse_loss: 0.018878939009419112
INFO: train/mse_loss: 0.018366146380821978
INFO: epoch: 2
INFO: iters: 46976
INFO: train/mse_loss: 0.014938667484635498
INFO: train/mse_loss: 0.0148746125182753
INFO: train/mse_loss: 0.014505743447781326
INFO: train/mse_loss: 0.014465472793432741
INFO: epoch: 3
INFO: iters: 70464
INFO: train/mse_loss: 0.014389766222024227
INFO: train/mse_loss: 0.013453237237986066
INFO: train/mse_loss: 0.013306563555842919
INFO: epoch: 4
INFO: iters: 93952
INFO: train/mse_loss: 0.012647044245178611
INFO: train/mse_loss: 0.012807737045385967
INFO: train/mse_loss: 0.011968838741840434
INFO: epoch: 5
INFO: iters: 117440
INFO:

------------------------------Validation Start------------------------------
INFO: val/mae: 0.3139403760433197
INFO:
------------------------------Validation End------------------------------

INFO: train/mse_loss: 0.011829124199711352
INFO: epoch: 6
INFO: iters: 140938
INFO: train/mse_loss: 0.010201521161369924
INFO: epoch: 7
INFO: iters: 164426
INFO: train/mse_loss: 0.010018873226117376
INFO: epoch: 8
INFO: iters: 187914
INFO: train/mse_loss: 0.009995935927926308
INFO: epoch: 9
INFO: iters: 211402
INFO: train/mse_loss: 0.009544536813287326
INFO: epoch: 10
INFO: iters: 234890
INFO: Saving the self at the end of epoch 10
INFO:

------------------------------Validation Start------------------------------
INFO: val/mae: 0.43820616602897644
INFO:
------------------------------Validation End------------------------------
'''

@1228967342
Copy link

I experienced the same problem. BTW, have you guys checked the training log? According to mine, it seems that the network is sufferred from a severe overfiting: ''' INFO: Begin model train. INFO: train/mse_loss: 0.1167483588039875 INFO: train/mse_loss: 0.0724316855113022 INFO: train/mse_loss: 0.06527451830048543 INFO: epoch: 1 INFO: iters: 23488 INFO: train/mse_loss: 0.020401993506137254 INFO: train/mse_loss: 0.018878939009419112 INFO: train/mse_loss: 0.018366146380821978 INFO: epoch: 2 INFO: iters: 46976 INFO: train/mse_loss: 0.014938667484635498 INFO: train/mse_loss: 0.0148746125182753 INFO: train/mse_loss: 0.014505743447781326 INFO: train/mse_loss: 0.014465472793432741 INFO: epoch: 3 INFO: iters: 70464 INFO: train/mse_loss: 0.014389766222024227 INFO: train/mse_loss: 0.013453237237986066 INFO: train/mse_loss: 0.013306563555842919 INFO: epoch: 4 INFO: iters: 93952 INFO: train/mse_loss: 0.012647044245178611 INFO: train/mse_loss: 0.012807737045385967 INFO: train/mse_loss: 0.011968838741840434 INFO: epoch: 5 INFO: iters: 117440 INFO:

------------------------------Validation Start------------------------------ INFO: val/mae: 0.3139403760433197 INFO: ------------------------------Validation End------------------------------

INFO: train/mse_loss: 0.011829124199711352 INFO: epoch: 6 INFO: iters: 140938 INFO: train/mse_loss: 0.010201521161369924 INFO: epoch: 7 INFO: iters: 164426 INFO: train/mse_loss: 0.010018873226117376 INFO: epoch: 8 INFO: iters: 187914 INFO: train/mse_loss: 0.009995935927926308 INFO: epoch: 9 INFO: iters: 211402 INFO: train/mse_loss: 0.009544536813287326 INFO: epoch: 10 INFO: iters: 234890 INFO: Saving the self at the end of epoch 10 INFO:

------------------------------Validation Start------------------------------ INFO: val/mae: 0.43820616602897644 INFO: ------------------------------Validation End------------------------------ '''

没有,扩散模型的损失函数计算是计算噪声和预测噪声间的mse_loss,详见:#26 (comment) -1282232897。而且扩散模型的推理也存在很大的随机性,出现这种情况很正常

23-09-03 04:09:21.974 - INFO: train/mse_loss: 0.004320403648868778
23-09-03 04:09:21.974 - INFO: epoch: 1423
23-09-03 04:09:21.974 - INFO: iters: 2072372
23-09-03 04:09:21.974 - INFO: Saving the self at the end of epoch 1423
23-09-03 04:09:23.265 - INFO:

------------------------------Validation Start------------------------------
23-09-03 04:20:12.848 - INFO: val/1-ssim: 0.1557578444480896
23-09-03 04:20:12.848 - INFO:
------------------------------Validation End------------------------------

23-09-03 04:23:16.320 - INFO: train/mse_loss: 0.004661682129078468
23-09-03 04:23:16.320 - INFO: epoch: 1424
23-09-03 04:23:16.320 - INFO: iters: 2073832
23-09-03 04:23:16.320 - INFO: Saving the self at the end of epoch 1424
23-09-03 04:23:17.622 - INFO:

------------------------------Validation Start------------------------------
23-09-03 04:34:06.690 - INFO: val/1-ssim: 0.10180902481079102
23-09-03 04:34:06.690 - INFO:
------------------------------Validation End------------------------------

23-09-03 04:37:05.177 - INFO: train/mse_loss: 0.004233014806692961
23-09-03 04:37:05.177 - INFO: epoch: 1425
23-09-03 04:37:05.177 - INFO: iters: 2075292
23-09-03 04:37:05.177 - INFO: Saving the self at the end of epoch 1425
23-09-03 04:37:06.475 - INFO:

------------------------------Validation Start------------------------------
23-09-03 04:47:56.020 - INFO: val/1-ssim: 0.1559600830078125
23-09-03 04:47:56.020 - INFO:
------------------------------Validation End------------------------------

23-09-03 04:50:55.078 - INFO: train/mse_loss: 0.004784488215476157
23-09-03 04:50:55.078 - INFO: epoch: 1426
23-09-03 04:50:55.078 - INFO: iters: 2076752
23-09-03 04:50:55.078 - INFO: Saving the self at the end of epoch 1426
23-09-03 04:50:56.547 - INFO:

------------------------------Validation Start------------------------------
23-09-03 05:01:45.988 - INFO: val/1-ssim: 0.06806707382202148

@yuanc3
Copy link

yuanc3 commented Sep 22, 2023

您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几

@TumVink
Copy link

TumVink commented Sep 27, 2023

For me, it happens as well. The training loss decreases very quickly and drops to 0.02 after 5 epochs, but the validation result is bad as hell.
Someone has an idea?

@TumVink
Copy link

TumVink commented Sep 27, 2023

@TumVink
Copy link

TumVink commented Oct 10, 2023

您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几

I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training.

@1228967342
Copy link

1228967342 commented Oct 21, 2023 via email

@TumVink
Copy link

TumVink commented Oct 21, 2023

我最开始训练效果也不好,用默认的mse损失函数训练1200epoch着色的图片会偏色严重,后来换了一个损失函数好一些,但是没有碰到val上只能生成噪声的情况。附带一些效果不好的val图片 我的训练集只有1500张,但是更换损失函数之后在测试集上的效果还可以,10k张图应该没那么容易过拟合,可以试试用训练的图片跑试试,可能连训练集上都没办法取得很好的着色效果,建议换一下损失函数试试,我目前的训练效果还可以 ------------------ 原始邮件 ------------------ 发件人: "Janspiry/Palette-Image-to-Image-Diffusion-Models" @.>; 发送时间: 2023年10月11日(星期三) 凌晨2:44 @.>; @.@.>; 主题: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37) 您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几 I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Heyy Glad to hear it! At least it proves the correctness of this repo.
Would you mind sharing more details of the loss function? Is it like a image-level loss function for example structure-similarity loss?

BW,
Jingsong

@TumVink
Copy link

TumVink commented Oct 23, 2023 via email

@1228967342
Copy link

您提到的混合损失函数是真实变分下界和 BCE 的混合。我对么?
……
________________________________ Von: 1228967342 @.> Gesendet: Montag, 23. Oktober 2023 17:10:26 An: Janspiry/Palette-Image-to-Image-Diffusion-Models 抄送:Jingsong Liu;Comment Betreff:回复:[Janspiry/Palette-Image-to-Image-Diffusion-Models] 着色训练不起作用(问题#37)混合损失函数 ― 直接回复此电子邮件,在 GitHub 上查看< #37(评论) >,或取消订阅< https://github.com/notifications/unsubscribe-auth/ARFVZL5XXPK77A2XZTUCMZTYA2CGFAVCNFSM6AAAAAAQCBIWN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGQZDINRYGU >。您收到此消息是因为您发表了评论。消息 ID:@.>

不是,只是很简单的混合

@ludandandan
Copy link

@1228967342 你好,我也遇到了相同的问题,请问可以分享一下你的混合损失设计吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests