Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

Open
aspen01 opened this issue Mar 14, 2024 · 15 comments
Open

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

aspen01 opened this issue Mar 14, 2024 · 15 comments

Comments

@aspen01
Copy link

aspen01 commented Mar 14, 2024

In order to merge LoRA checkpoint for llama 2 7B model, I run python merge_lora.py.

But an error occured,

Traceback (most recent call last):
  File "/Users/xxx/llama/slowllama/merge_lora.py", line 14, in <module>
    add_lora(model_path, lora_path, out_model_path)
  File "/Users/xxx/llama/slowllama/loader.py", line 188, in add_lora
    lora = lora_weights[b_key].mm(lora_weights[a_key]) * lora_scale
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

So I modified the code like below and I got the merged model file.

lora = lora_weights[b_key].to(torch.float32).mm(lora_weights[a_key].to(torch.float32)) * lora_scale

But I wonder it's okay or not.
Can you give the opinion or right solution?

@okuvshynov
Copy link
Owner

Oh interesting, thank you. Let me take a look

@okuvshynov
Copy link
Owner

What types were you using for finetuning?
I think you'll need to double-check that merged weights produce the same outputs as non-merged (https://github.com/okuvshynov/slowllama?tab=readme-ov-file#merging-lora-weights-back).

@aspen01
Copy link
Author

aspen01 commented Mar 15, 2024

I used the types of conf_fp16.py.

adamw_eps = 1e-4
compute_dtype = torch.float16
frozen_dtype = torch.float16

@okuvshynov
Copy link
Owner

Got it. I think I'll need to try it myself to double-check (we transform weights fp16->fp32->bf16), but if the merged model produces reasonable output it should be ok.

@aspen01
Copy link
Author

aspen01 commented Mar 16, 2024

Sometimes the merged model produces the expected results. But I don't know whether the unexpected results are due to
the merged weights or insufficient fine-tuning.

@okuvshynov
Copy link
Owner

As I understand, you are doing finetuning on CPU? I'm not sure if there's any benefit of using fp16, if the underlying architecture doesn't support it natively.

@aspen01
Copy link
Author

aspen01 commented Mar 22, 2024

I'm testing fine-tuning on an Apple M1 and I know that it uses the GPU during fine-tuning.
I tried fine-tuning using CPU in llama.cpp, but slowllama takes less learning time, so I want to try fine-tuning with slowllama.

@okuvshynov
Copy link
Owner

Can you still reproduce this after our fix in #16? When I tried it that time on apple m1 i didn't have to convert to f32 and back.

@Nirjhor27
Copy link

Actually, I tried it yesterday on m2 ultra and had the same issue, I had to do the float32 conversion and that solved it.

@okuvshynov
Copy link
Owner

@Nirjhor27 Interesting! which torch version are you using? The error essentially says that fp16 operations are not implemented for CPU. On my M1/M2 devices I can do that though:

>>> import torch
>>> torch.__version__
'2.2.1'
>>> a = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> b = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> a.mm(b)
tensor([[0.3838, 1.0488],
        [0.0728, 0.4006]], dtype=torch.float16)

Does this snippet work for you?

@Nirjhor27
Copy link

I am using 2.1.2.
And nope, running the snippet results in:
Traceback (most recent call last):
File "", line 1, in
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

I understood that fp16 is for gpu and not cpu, but I am also worried if doing the conversion as Aspent suggested will mess up the weights when merging. I could merge after doing the float 32 conversion and the merged model appears to be working fine, but I have the same question as Aspen -> if it's actually okay or not.

@okuvshynov
Copy link
Owner

Interesting, maybe it has something to do with recent work in torch, e.g. pytorch/pytorch@2240018. I cannot test older torch version right now, I'll need to downgrade python as well.

I'll make a change to detect if device supports fp16. Alternatively we can run the merge_lora on mps device as well.

@Nirjhor27
Copy link

Thanks, will keep an eye out and update if I find an alternative than the float32 conversion.

@okuvshynov
Copy link
Owner

okuvshynov commented Mar 26, 2024

f055a88

I suspect the result might be a little different, but not sure how big of a difference will it make.

Btw, @Nirjhor27 - if you've used m2 ultra, what was the GPU utilization when you tried to finetune? Thank you!

@Nirjhor27
Copy link

I haven't checked it yet (I am using a remote client) - however, I am planning/have to check it very soon when I finetune again, I will update you on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants