Not receiving grads with cpu_besides? #4

conceptofmind · 2023-12-14T19:39:05Z

Hi Phil,

I have been working with @tomaarsen of HF and @haileyschoelkopf of EAI testing soft moe.

One issue that was occurring was that the tensors were not contiguous:

    gathered_tensors = all_gather_same_dim(t)
    return AllGatherFunction.apply(x, self.dim, sizes)
    x, batch_sizes = all_gather_variable_dim(x, dim = dim, sizes = sizes)
    gathered_tensors = all_gather_same_dim(t)
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be contiguous

Adding .contiguous() to t in all_gather_same_dim() seemed to resolve this issue:

def all_gather_same_dim(t):
    world_size = dist.get_world_size()
    gathered_tensors = [torch.empty_like(t.contiguous(), device = t.device, dtype = t.dtype) for i in range(world_size)]
    dist.all_gather(gathered_tensors, t.contiguous())
    return gathered_tensors

But after another issue presented itself where the parameter indices were not receiving grads during backward pass:

Parameter indices which did not receive grad for rank 1: 8 9 10 11 12 13 14 15 30 31 32 33 34 35 36 37 52 53 54 55 56 57 58 59 74 75 76 77 78 79 80 81

Parameter indices which did not receive grad for rank 0: 16 17 18 19 20 21 22 23 38 39 40 41 42 43 44 45 60 61 62 63 64 65 66 67 82 83 84 85 86 87 88 89

We diagnosed this back to self.all_experts_to_cpu_besides(expert_slice):

        # get the experts in use

        self.all_experts_to_cpu_besides(expert_slice)

        experts = self.experts[expert_slice]

By commenting out self.all_experts_to_cpu_besides(expert_slice) the script would then run and loss would decrease seemingly normally with amp.

Do you have any idea why the above issue would occur or how it should properly be resolved?

Always greatly appreciate your help.

Thank you,

Enrico

The text was updated successfully, but these errors were encountered:

lucidrains · 2023-12-14T20:06:59Z

@conceptofmind thanks for pointing out the need for a contiguous

for the latter issue, i don't really know off the top of my head, not without spending some time debugging

is this on one machine? maybe the logic for determining which expert is activated can be moved to init and the device can be set before forward? would welcome a PR if you figure it out, now that mixture of experts is becoming a hot thing

lucidrains · 2023-12-14T20:08:26Z

@conceptofmind i thought you were working on startups with Aran? lol

that's what Aran told me last i chat with him

regardless, it is cool you are all working on MoE! it needs more love in the open source space

tomaarsen · 2023-12-14T20:12:04Z

@lucidrains Regarding 8c3fedb, t also needed to be converted to contiguous, e.g. note the second-to-last line:

def all_gather_same_dim(t):
    world_size = dist.get_world_size()
    gathered_tensors = [torch.empty_like(t.contiguous(), device = t.device, dtype = t.dtype) for i in range(world_size)]
    dist.all_gather(gathered_tensors, t.contiguous())
    return gathered_tensors

lucidrains · 2023-12-14T20:12:52Z

@tomaarsen oops, fixed

conceptofmind · 2023-12-14T20:22:18Z

@conceptofmind thanks for pointing out the need for a contiguous

for the latter issue, i don't really know off the top of my head, not without spending some time debugging

is this on one machine? maybe the logic for determining which expert is activated can be moved to init and the device can be set before forward? would welcome a PR if you figure it out, now that mixture of experts is becoming a hot thing

Hi Phil,

Thank you for the response.

We are happy to continue to diagnose the issue and open a PR. Can also provide some DDP training code later too.

This is currently one machine with 8 GPUs run using torchrun --nnodes=1 --nproc_per_node=8 script.py.

I should also clarify a little bit that the inclusion of find_unused_parameters=True was required in DDP as well as commenting out self.all_experts_to_cpu_besides(expert_slice):

self.model = DDP(
  self.model, 
  device_ids=[self.local_rank],
  output_device=self.local_rank,
  find_unused_parameters=True,
  gradient_as_bucket_view=True
)

If find_unused_params is not set to True and self.all_experts_to_cpu_besides(expert_slice) remains this error occurs about device placement:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu!
    exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and CPU!

We checked to see if the DDP model was on cuda and it looked like so:

cuda:0
cuda:1
cuda:7
cuda:6
cuda:4
cuda:2
cuda:3
cuda:5

@conceptofmind i thought you were working on startups with Aran? lol

that's what Aran told me last i chat with him

regardless, it is cool you are all working on MoE! it needs more love in the open source space

Aran and I did cofound a startup together! We will hopefully have some interesting things related to OSS MOE out in the near future!

We have been trying to organize research efforts across OSS organizations such as EAI, HF, LAION, SAI, etc. as we feel like this type of collaboration will lead to much more fruitful results.

lucidrains · 2023-12-14T20:58:46Z

yup, makes sense re: ddp settings

a PR would be a great contribution! there's really only a handful of people who are working on MoE

ok cool, looking forward to seeing what you and Aran end up building!

lucidrains · 2023-12-14T21:14:14Z

are you aware that soft moe will not work in LLMs? also have this built https://github.com/lucidrains/st-moe-pytorch

tomaarsen · 2023-12-14T21:17:13Z

Yes, we are not looking to apply this for LLMs :)

conceptofmind · 2023-12-14T21:26:50Z

are you aware that soft moe will not work in LLMs? also have this built https://github.com/lucidrains/st-moe-pytorch

Hopefully can provide more information to fulfill that curiosity soon 😄

Can send an email or update here with some preliminary results.

We are aware that this type of MOE is incompatible with LLMs but think it can be applied to some other interesting use cases.

Going to test out st-moe as well!

lucidrains · 2023-12-14T21:32:24Z

@conceptofmind yes indeed, i've seen some papers using mixture of experts in text-to-image models with great success

nice, as long as you are aware!

conceptofmind · 2023-12-15T16:23:53Z

@conceptofmind yes indeed, i've seen some papers using mixture of experts in text-to-image models with great success

nice, as long as you are aware!

Hi Phil,

We definitely appreciate you bringing notice to this. I imagine it will save someone the confusion regarding soft-moe and LLMs if they read this issue in the future!

RE: if you all ever get stuck, if you can get me ssh access to one machine with multiple GPUs, i can put in some hours into debugging too

Absolutely! If we do get stuck I can definitely get you access to one machine or multiple machines through LAION/SAI or with one of our grants from Modal/Latitude. Currently talking to Huggingface about compute allocations as well.

Thank you,

Enrico

conceptofmind · 2023-12-17T00:43:11Z

Quick update.

Using gradient_as_bucket_view=True with self.all_experts_to_cpu_besides(expert_slice):

self.model = DDP(
    self.model, 
    device_ids=[self.local_rank],
    output_device=self.local_rank,
    find_unused_parameters=True,
    gradient_as_bucket_view=True
)

Causes the error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu!
    exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and CPU!

Until further evaluation, one or the other needs to be chosen. Setting gradient_as_bucket_view=False is a bit less performant in terms of FLOPS for training. Going to test to ensure whether gradient_as_bucket_view=True is fully compatible with MOE.

Is self.all_experts_to_cpu_besides(expert_slice) a requirement for training? Or just an inference time improvement?

lucidrains · 2023-12-17T16:24:47Z

@conceptofmind could you show the full stack trace? what would happen if you just cloned the adam / adamw code and replace that line with exp_avg.lerp_(grad.to(exp_avg.device), 1 - beta1) ?

conceptofmind · 2023-12-17T18:18:06Z

@conceptofmind could you show the full stack trace? what would happen if you just cloned the adam / adamw code and replace that line with exp_avg.lerp_(grad.to(exp_avg.device), 1 - beta1) ?

Hi Phil,

Here is the stack trace:

    optimizer.step()
  File "lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    ret = func(self, *args, **kwargs)
  File "/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
    return wrapped(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    adamw(
  File "/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
    out = func(*args, **kwargs)
  File "lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    func(
  File "lib/python3.10/site-packages/torch/optim/adamw.py", line 412, in _single_tensor_adamw
    ret = func(self, *args, **kwargs)
  File "lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
    exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!
    adamw(
  File "/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
    func(
  File "/lib/python3.10/site-packages/torch/optim/adamw.py", line 412, in _single_tensor_adamw
    exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!

Replacing the line in the adam /adam w code with exp_avg.lerp_(grad.to(exp_avg.device), 1 - beta1): https://github.com/pytorch/pytorch/blob/34fe850d0083688abf0a27f3e864723f0858aab1/torch/optim/adam.py#L401

conceptofmind · 2023-12-17T19:47:07Z

I imported:
https://github.com/pytorch/pytorch/blob/main/torch/optim/adamw.py
https://github.com/pytorch/pytorch/blob/main/torch/optim/adam.py
https://github.com/pytorch/pytorch/blob/main/torch/optim/optimizer.py

And made your recommended adjustment above.

This does resolve the exp_avg.lerp_(grad.to(exp_avg.device), 1 - beta1) device mismatch but there is also a device mismatch error on: https://github.com/pytorch/pytorch/blob/34fe850d0083688abf0a27f3e864723f0858aab1/torch/optim/adam.py#L402

I added: exp_avg_sq.mul_(beta2).addcmul_(grad.to(exp_avg_sq.device), grad.to(exp_avg_sq.device).conj(), value=1 - beta2) to device as well and it seems to work alright now.

Maybe the placement of the grad to a device could be done in the for loop instead?

https://github.com/pytorch/pytorch/blob/34fe850d0083688abf0a27f3e864723f0858aab1/torch/optim/adam.py#L371

for i, param in enumerate(params):
        grad = grads[i] if not maximize else -grads[i]

        grad = grad.to(param.device)

        exp_avg = exp_avgs[i]
        exp_avg_sq = exp_avg_sqs[i]
        step_t = state_steps[I]

Both of these seem to work ok with some other minor adjustments to the Adam code (prevent type checking errors) and allow the use of gradient_as_bucket_view=True.

lucidrains · 2023-12-17T20:18:01Z

@conceptofmind yup! you got it 💯

conceptofmind · 2023-12-18T00:14:22Z

@conceptofmind yup! you got it 💯

Thank you for the pointer!

This may be a good place to utilize something like Mitchell Wortsman's Stable Adam W as it would require fewer alterations over importing adam files:
https://gist.github.com/mitchellnw/d42e22a0b9ec02ceaf4f7b4457f51423

I messaged Horace He to see if there is a reasonable solution to getting around needing to use find_unused_parameters=True to handle:

Parameter indices which did not receive grad for rank 1: 8 9 10 11 12 13 14 15 30 31 32 33 34 35 36 37 52 53 54 55 56 57 58 59 74 75 76 77 78 79 80 81

Parameter indices which did not receive grad for rank 0: 16 17 18 19 20 21 22 23 38 39 40 41 42 43 44 45 60 61 62 63 64 65 66 67 82 83 84 85 86 87 88 89

Going to see if hooks can potentially be used to deal with the pass.

There is one other slight issue we are investigating.

Accidentally closed. Oops!

lucidrains · 2023-12-18T16:57:00Z

@conceptofmind @tomaarsen @haileyschoelkopf do let me know what you end up seeing anything in your experiments, positive or negative

conceptofmind · 2023-12-18T17:10:58Z

@conceptofmind @tomaarsen @haileyschoelkopf do let me know what you end up seeing anything in your experiments, positive or negative

Definitely will do! We are going to meet with Carlos Riquelme very soon to discuss Soft-MoE and collaboration together. I will keep you updated on that as well.

tomaarsen · 2023-12-18T17:12:17Z

Certainly!

conceptofmind · 2023-12-20T22:00:06Z

Hi Phil,

@haileyschoelkopf, @AranKomat, and I plan to call with Carlos Riquelme about Soft-MoE/MoE on Friday, 11:30 am EST.

I wanted to extend the invitation to you as well if that is something of interest.

If not, I can just post a discussion update in relation as well.

Best,

Enrico

lucidrains · 2023-12-20T23:38:48Z

@conceptofmind oh i won't be able to make it at that time

yea keep me posted 👍

lucidrains · 2024-02-17T14:07:41Z

@conceptofmind @tomaarsen how did it go?

just came across this paper, so it seems like there is something to it https://arxiv.org/abs/2402.08609

conceptofmind · 2024-02-17T23:03:42Z

It is still going 😆.

We are continuing ablations.

RE: if you all ever get stuck, if you can get me ssh access to one machine with multiple GPUs, i can put in some hours into debugging too

I will likely get you some hardware soon to test a few things since we were hitting a few snags here and there.

lucidrains · 2024-02-17T23:29:17Z

ah sounds good

no worries, was just curious

conceptofmind · 2024-04-04T16:25:43Z

We ran into a snag where StabilityAI rescinded compute they initially offered.

We fortunately were able to get it through LAION instead and will be continuing experiments now.

conceptofmind closed this as completed Dec 18, 2023

conceptofmind reopened this Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not receiving grads with cpu_besides? #4

Not receiving grads with cpu_besides? #4

conceptofmind commented Dec 14, 2023

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023 •

edited

tomaarsen commented Dec 14, 2023 •

edited

lucidrains commented Dec 14, 2023

conceptofmind commented Dec 14, 2023 •

edited

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023

tomaarsen commented Dec 14, 2023

conceptofmind commented Dec 14, 2023 •

edited

lucidrains commented Dec 14, 2023 •

edited

conceptofmind commented Dec 15, 2023

conceptofmind commented Dec 17, 2023

lucidrains commented Dec 17, 2023

conceptofmind commented Dec 17, 2023 •

edited

conceptofmind commented Dec 17, 2023 •

edited

lucidrains commented Dec 17, 2023

conceptofmind commented Dec 18, 2023 •

edited

lucidrains commented Dec 18, 2023 •

edited

conceptofmind commented Dec 18, 2023

tomaarsen commented Dec 18, 2023

conceptofmind commented Dec 20, 2023 •

edited

lucidrains commented Dec 20, 2023

lucidrains commented Feb 17, 2024

conceptofmind commented Feb 17, 2024

lucidrains commented Feb 17, 2024

conceptofmind commented Apr 4, 2024

Not receiving grads with cpu_besides? #4

Not receiving grads with cpu_besides? #4

Comments

conceptofmind commented Dec 14, 2023

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023 • edited

tomaarsen commented Dec 14, 2023 • edited

lucidrains commented Dec 14, 2023

conceptofmind commented Dec 14, 2023 • edited

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023

tomaarsen commented Dec 14, 2023

conceptofmind commented Dec 14, 2023 • edited

lucidrains commented Dec 14, 2023 • edited

conceptofmind commented Dec 15, 2023

conceptofmind commented Dec 17, 2023

lucidrains commented Dec 17, 2023

conceptofmind commented Dec 17, 2023 • edited

conceptofmind commented Dec 17, 2023 • edited

lucidrains commented Dec 17, 2023

conceptofmind commented Dec 18, 2023 • edited

lucidrains commented Dec 18, 2023 • edited

conceptofmind commented Dec 18, 2023

tomaarsen commented Dec 18, 2023

conceptofmind commented Dec 20, 2023 • edited

lucidrains commented Dec 20, 2023

lucidrains commented Feb 17, 2024

conceptofmind commented Feb 17, 2024

lucidrains commented Feb 17, 2024

conceptofmind commented Apr 4, 2024

lucidrains commented Dec 14, 2023 •

edited

tomaarsen commented Dec 14, 2023 •

edited

conceptofmind commented Dec 14, 2023 •

edited

conceptofmind commented Dec 14, 2023 •

edited

lucidrains commented Dec 14, 2023 •

edited

conceptofmind commented Dec 17, 2023 •

edited

conceptofmind commented Dec 17, 2023 •

edited

conceptofmind commented Dec 18, 2023 •

edited

lucidrains commented Dec 18, 2023 •

edited

conceptofmind commented Dec 20, 2023 •

edited