RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

aspen01 · 2024-03-14T10:16:16Z

In order to merge LoRA checkpoint for llama 2 7B model, I run python merge_lora.py.

But an error occured,

Traceback (most recent call last):
  File "/Users/xxx/llama/slowllama/merge_lora.py", line 14, in <module>
    add_lora(model_path, lora_path, out_model_path)
  File "/Users/xxx/llama/slowllama/loader.py", line 188, in add_lora
    lora = lora_weights[b_key].mm(lora_weights[a_key]) * lora_scale
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

So I modified the code like below and I got the merged model file.

lora = lora_weights[b_key].to(torch.float32).mm(lora_weights[a_key].to(torch.float32)) * lora_scale

But I wonder it's okay or not.
Can you give the opinion or right solution?

The text was updated successfully, but these errors were encountered:

okuvshynov · 2024-03-14T18:05:46Z

Oh interesting, thank you. Let me take a look

okuvshynov · 2024-03-14T18:11:52Z

What types were you using for finetuning?
I think you'll need to double-check that merged weights produce the same outputs as non-merged (https://github.com/okuvshynov/slowllama?tab=readme-ov-file#merging-lora-weights-back).

aspen01 · 2024-03-15T04:57:46Z

I used the types of conf_fp16.py.

adamw_eps = 1e-4
compute_dtype = torch.float16
frozen_dtype = torch.float16

okuvshynov · 2024-03-15T16:07:49Z

Got it. I think I'll need to try it myself to double-check (we transform weights fp16->fp32->bf16), but if the merged model produces reasonable output it should be ok.

aspen01 · 2024-03-16T03:19:18Z

Sometimes the merged model produces the expected results. But I don't know whether the unexpected results are due to
the merged weights or insufficient fine-tuning.

okuvshynov · 2024-03-22T00:13:02Z

As I understand, you are doing finetuning on CPU? I'm not sure if there's any benefit of using fp16, if the underlying architecture doesn't support it natively.

aspen01 · 2024-03-22T07:51:41Z

I'm testing fine-tuning on an Apple M1 and I know that it uses the GPU during fine-tuning.
I tried fine-tuning using CPU in llama.cpp, but slowllama takes less learning time, so I want to try fine-tuning with slowllama.

okuvshynov · 2024-03-25T15:43:17Z

Can you still reproduce this after our fix in #16? When I tried it that time on apple m1 i didn't have to convert to f32 and back.

Nirjhor27 · 2024-03-25T16:08:33Z

Actually, I tried it yesterday on m2 ultra and had the same issue, I had to do the float32 conversion and that solved it.

okuvshynov · 2024-03-25T16:49:39Z

@Nirjhor27 Interesting! which torch version are you using? The error essentially says that fp16 operations are not implemented for CPU. On my M1/M2 devices I can do that though:

>>> import torch
>>> torch.__version__
'2.2.1'
>>> a = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> b = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> a.mm(b)
tensor([[0.3838, 1.0488],
        [0.0728, 0.4006]], dtype=torch.float16)

Does this snippet work for you?

Nirjhor27 · 2024-03-25T17:09:26Z

I am using 2.1.2.
And nope, running the snippet results in:
Traceback (most recent call last):
File "", line 1, in
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

I understood that fp16 is for gpu and not cpu, but I am also worried if doing the conversion as Aspent suggested will mess up the weights when merging. I could merge after doing the float 32 conversion and the merged model appears to be working fine, but I have the same question as Aspen -> if it's actually okay or not.

okuvshynov · 2024-03-25T17:37:06Z

Interesting, maybe it has something to do with recent work in torch, e.g. pytorch/pytorch@2240018. I cannot test older torch version right now, I'll need to downgrade python as well.

I'll make a change to detect if device supports fp16. Alternatively we can run the merge_lora on mps device as well.

Nirjhor27 · 2024-03-25T17:41:59Z

Thanks, will keep an eye out and update if I find an alternative than the float32 conversion.

okuvshynov · 2024-03-26T00:30:37Z

f055a88

I suspect the result might be a little different, but not sure how big of a difference will it make.

Btw, @Nirjhor27 - if you've used m2 ultra, what was the GPU utilization when you tried to finetune? Thank you!

Nirjhor27 · 2024-03-26T00:35:31Z

I haven't checked it yet (I am using a remote client) - however, I am planning/have to check it very soon when I finetune again, I will update you on that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

aspen01 commented Mar 14, 2024 •

edited

okuvshynov commented Mar 14, 2024

okuvshynov commented Mar 14, 2024

aspen01 commented Mar 15, 2024

okuvshynov commented Mar 15, 2024

aspen01 commented Mar 16, 2024

okuvshynov commented Mar 22, 2024

aspen01 commented Mar 22, 2024

okuvshynov commented Mar 25, 2024

Nirjhor27 commented Mar 25, 2024

okuvshynov commented Mar 25, 2024

Nirjhor27 commented Mar 25, 2024

okuvshynov commented Mar 25, 2024

Nirjhor27 commented Mar 25, 2024

okuvshynov commented Mar 26, 2024 •

edited

Nirjhor27 commented Mar 26, 2024

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

Comments

aspen01 commented Mar 14, 2024 • edited

okuvshynov commented Mar 14, 2024

okuvshynov commented Mar 14, 2024

aspen01 commented Mar 15, 2024

okuvshynov commented Mar 15, 2024

aspen01 commented Mar 16, 2024

okuvshynov commented Mar 22, 2024

aspen01 commented Mar 22, 2024

okuvshynov commented Mar 25, 2024

Nirjhor27 commented Mar 25, 2024

okuvshynov commented Mar 25, 2024

Nirjhor27 commented Mar 25, 2024

okuvshynov commented Mar 25, 2024

Nirjhor27 commented Mar 25, 2024

okuvshynov commented Mar 26, 2024 • edited

Nirjhor27 commented Mar 26, 2024

aspen01 commented Mar 14, 2024 •

edited

okuvshynov commented Mar 26, 2024 •

edited