error:tImplementedError: Could not run 'aten::_amp_foreach_non_finite_check_and_unscale_' with arguments from the 'CPU' backend. #3

daixiangzi · 2023-12-05T09:09:10Z

when I use:
from torch.cuda.amp import GradScaler
amp = GradScaler(init_scale=512, growth_interval=100)
for img, label in train_iter:
with torch.cuda.amp.autocast(True):
label = backbone(img)
amp.scale(loss).backward()
amp.unscale_(opt)

lucidrains · 2023-12-05T13:50:11Z

@daixiangzi that looks like an unrelated error that is specific to cpu with the amp.scale? could you retry on cuda?

daixiangzi · 2023-12-06T03:40:36Z

@daixiangzi that looks like an unrelated error that is specific to cpu with the amp.scale? could you retry on cuda?

When I use a single machine with multiple gpu, I can run your code, but when I use multiple machines with multiple gpus,error will appear.

daixiangzi · 2023-12-06T03:42:40Z

@daixiangzi that looks like an unrelated error that is specific to cpu with the amp.scale? could you retry on cuda?
I have always used it on cuda

lucidrains · 2023-12-06T15:06:21Z

are you using fsdp?

daixiangzi · 2023-12-07T03:37:24Z

fsdp

no ,I use ddp

lucidrains · 2023-12-08T16:26:07Z

ah, hard for me to debug. i only have a single machine

lucidrains · 2023-12-08T16:27:03Z

@daixiangzi could you try disabling mixed precision just for the MoE block?

with torch.cuda.amp.autocast(enabled = False):
    # ... your moe forward

lucidrains · 2023-12-08T17:08:48Z

finally seeing the limits of pytorch

in jax, properly working moe is just a few lines of code

daixiangzi · 2023-12-10T06:03:58Z

@daixiangzi could you try disabling mixed precision just for the MoE block?
with torch.cuda.amp.autocast(enabled = False):
    # ... your moe forward

ok

daixiangzi · 2024-01-04T06:51:09Z

@daixiangzi could you try disabling mixed precision just for the MoE block?
with torch.cuda.amp.autocast(enabled = False):
    # ... your moe forward

I try it just now, but not work

daixiangzi · 2024-01-04T06:58:19Z

define moe layer:

self.softmoe = SoftMoE(dim = 512,num_experts =64,is_distributed=True,offload_unused_experts_to_cpu=True)

forward:

with torch.cuda.amp.autocast(enabled =False):
                x = x + self.softmoe(self.ln_3(x).permute(1,0,2)).permute(1,0,2)

daixiangzi · 2024-01-04T07:06:37Z

NotImplementedError: Could not run 'aten::amp_foreach_non_finite_check_and_unscale' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build proces

daixiangzi · 2024-01-04T07:25:10Z

in fact I train dp version on single machine with multiple gpu, but result is still bad, I have a little doubt about

soft-moe-pytorch/soft_moe_pytorch/soft_moe.py

Line 315 in 4dc07b0

x = self.norm(x)

I see origin rep:https://github.com/google-research/vmoe/blob/dfb9ee01ce6dfc5a8228b406e768ee325dd18fcd/vmoe/nn/vit_moe.py#L109C27-L109C45
your code:

x = self.norm(x)
logits = einsum('b n d, e s d -> b n e s', x, slot_embeds)
...
slots = einsum('b n d, b n e s -> b e s d', x, dispatch_weights)

origin rep code seem is:

norm_x = self.norm(x)
logits = einsum('b n d, e s d -> b n e s', norm_x, slot_embeds)
....
slots = einsum('b n d, b n e s -> b e s d', x, dispatch_weights)

lucidrains · 2024-01-04T13:30:37Z

@daixiangzi hmm, have you tried it their way? conventionally we always work with pre-normalized values, so that would be weird they dispatch based on unnormalized input into the block

daixiangzi · 2024-01-04T14:27:23Z

@daixiangzi hmm, have you tried it their way? conventionally we always work with pre-normalized values, so that would be weird they dispatch based on unnormalized input into the block

no ,I will try it.but I always have some doubts about the effectiveness of softmoe, mainly because I have been adjusting it for a long time

daixiangzi · 2024-01-04T14:29:34Z

In terms of visual tasks, there are currently many open source versions of MOE, but in fact, there is not a sufficient experiment to demonstrate the effectiveness of this approach

lucidrains · 2024-01-04T14:33:48Z

@daixiangzi yea it happens

thanks for sharing your results

just to make sure we don't miss anything, would you like to also do a quick run where i substituted the rmsnorm with layernorm? (set https://github.com/lucidrains/soft-moe-pytorch/blob/main/soft_moe_pytorch/soft_moe.py#L287 to True)

daixiangzi · 2024-01-04T14:45:50Z

@daixiangzi yea it happens

thanks for sharing your results

just to make sure we don't miss anything, would you like to also do a quick run where i substituted the rmsnorm with layernorm? (set https://github.com/lucidrains/soft-moe-pytorch/blob/main/soft_moe_pytorch/soft_moe.py#L287 to True)

Actually, before this, I tried my own version where the norm part used LN, but the effect was still not as good as the baseline

lucidrains · 2024-01-04T14:48:51Z

@daixiangzi ah ok, good to know

i'll take your experience as a datapoint

daixiangzi · 2024-01-04T14:49:10Z

myself softmoe version:

        #x: [b,50,768]
        #phi [768, 128, slots]
        phi = self.phi
        phi = self.scale*self.normalize(phi,axis=0)
        #logits:[b, 50, 128, 1]
        
        logits = einsum(self.normalize(x,axis=-1), phi, "b m d, d n p -> b m n p")
        #add noise
        if self.training:
            normal = torch.randn(logits.shape, dtype=logits.dtype,device=logits.device)
            logits = logits +(1.0 * normal)
        #dispatch_weights:[b, token_num, 128, 1])
        dispatch_weights = logits.softmax(dim=1)  # denoted 'D' in the paper
        # NOTE: The 'torch.softmax' function does not support multiple values for the
        # 'dim' argument (unlike jax), so we are forced to flatten the last two dimensions.
        # Then, we rearrange the Tensor into its original shape.
        #对experts softmax:combine_weights:[b, token_num, 128, 1]
        combine_weights = rearrange(logits.flatten(start_dim=2).softmax(dim=-1),"b m (n p) -> b m n p",n=self.num_experts)

        # NOTE: To save memory, I don't rename the intermediate tensors Y, Ys, Xs.
        # Instead, I just overwrite the 'x' variable.  The names from the paper are
        # included in a comment for each line below.
        
        x = einsum(x, dispatch_weights, "b m d, b m n p -> b n p d")  # Xs
        #x:[b,128, 1, out_d]
        x = self.experts(x)  # Ys
        #combine_weights:[b, 50, 128, 1]
        x = einsum(x, combine_weights, "b n p d, b m n p -> b m d")  # Y
        #[b, 128, 1, 384]
        return x

daixiangzi · 2024-01-04T14:50:43Z

in softmoe paper:
self.normalize should l2_norm, for example:

 def normalize(self,x:torch.Tensor,axis:int):
        return x*torch.rsqrt(torch.square(x).sum(dim=axis, keepdim=True) + 1e-6)

lucidrains · 2024-01-04T14:51:56Z

in softmoe paper: self.normalize should l2_norm, for example:
 def normalize(self,x:torch.Tensor,axis:int):
        return x*torch.rsqrt(torch.square(x).sum(dim=axis, keepdim=True) + 1e-6)

ohh i did it correct then the first time

daixiangzi · 2024-01-04T14:56:33Z

in softmoe paper: self.normalize should l2_norm, for example:
 def normalize(self,x:torch.Tensor,axis:int):
        return x*torch.rsqrt(torch.square(x).sum(dim=axis, keepdim=True) + 1e-6)
ohh i did it correct then the first time

ok

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error:tImplementedError: Could not run 'aten::_amp_foreach_non_finite_check_and_unscale_' with arguments from the 'CPU' backend. #3

error:tImplementedError: Could not run 'aten::_amp_foreach_non_finite_check_and_unscale_' with arguments from the 'CPU' backend. #3

daixiangzi commented Dec 5, 2023

lucidrains commented Dec 5, 2023 •

edited

daixiangzi commented Dec 6, 2023 •

edited

daixiangzi commented Dec 6, 2023

lucidrains commented Dec 6, 2023

daixiangzi commented Dec 7, 2023 •

edited

lucidrains commented Dec 8, 2023

lucidrains commented Dec 8, 2023

lucidrains commented Dec 8, 2023

daixiangzi commented Dec 10, 2023

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024 •

edited

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

error:tImplementedError: Could not run 'aten::_amp_foreach_non_finite_check_and_unscale_' with arguments from the 'CPU' backend. #3

error:tImplementedError: Could not run 'aten::_amp_foreach_non_finite_check_and_unscale_' with arguments from the 'CPU' backend. #3

Comments

daixiangzi commented Dec 5, 2023

lucidrains commented Dec 5, 2023 • edited

daixiangzi commented Dec 6, 2023 • edited

daixiangzi commented Dec 6, 2023

lucidrains commented Dec 6, 2023

daixiangzi commented Dec 7, 2023 • edited

lucidrains commented Dec 8, 2023

lucidrains commented Dec 8, 2023

lucidrains commented Dec 8, 2023

daixiangzi commented Dec 10, 2023

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024 • edited

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Jan 4, 2024

daixiangzi commented Jan 4, 2024

lucidrains commented Dec 5, 2023 •

edited

daixiangzi commented Dec 6, 2023 •

edited

daixiangzi commented Dec 7, 2023 •

edited

daixiangzi commented Jan 4, 2024 •

edited