About multi-gpus training #27

YANDaoyu · 2024-03-01T03:57:47Z

That's definitely an impressive work!

I'm trying to reproduce some results on inpainting task and had some concern about the data_parallel mode.
Referring to the codes, batch_size is 4 for single GPU, total pairs of inpainting data is about 2.8m, thus the total log step is 700k.
When I training it on 8-GPUs, the total step still log as 700k, then I've checked the GPU-memory usage -- all the GPU are nearly fully used.
So I just wondering the training batch_size for 8-GPU is 4*8 or not? Or say there are some misalignment in logging?

Thanks for your time.

gallenszl · 2024-03-12T12:18:52Z

same question. Looking forward to the answer. @canqin001 @shugerdou

canqin001 · 2024-03-13T02:03:40Z

Thank you for this question. For multigpu training, the overall batch-size would be num_per_batch * num_batch. The 700k iterations is independent of the batch-size. So, it is needed to manually assign the iterations to match the overall computation cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About multi-gpus training #27

About multi-gpus training #27

YANDaoyu commented Mar 1, 2024

gallenszl commented Mar 12, 2024

canqin001 commented Mar 13, 2024

About multi-gpus training #27

About multi-gpus training #27

Comments

YANDaoyu commented Mar 1, 2024

gallenszl commented Mar 12, 2024

canqin001 commented Mar 13, 2024