Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot #1123

Open
exnx opened this issue Jan 15, 2024 · 3 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@exnx
Copy link
Contributor

exnx commented Jan 15, 2024

Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps. For example, I can fit a a desired sequence length with gradient_accumulation_steps at 1 and 4, but at 8, I get an out of memory error.

I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps:

grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM

My understanding was that gradient_accumulation_steps is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps and it just takes longer, but it shouldn't use that much more memory.

I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps.

Is anyone else experiencing this, or know if this is accurate?

Thanks!

@exnx exnx added the bug Something isn't working label Jan 15, 2024
@StellaAthena
Copy link
Member

StellaAthena commented Jan 15, 2024

Are you holding the microbatch size fixed? Or are you decreasing it as you increase gradient accumulation?

@exnx
Copy link
Contributor Author

exnx commented Jan 15, 2024

Hi @StellaAthena!

I'm increasing the total batch size by the gradient accumuation factor only. The micro batch size is just 1 actually, always, in my case.

@Quentin-Anthony Quentin-Anthony self-assigned this Jan 17, 2024
@exnx
Copy link
Contributor Author

exnx commented Jan 18, 2024

Would be great to hear if anyone else experienced this in general too, or if I'm a crazy person. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants