Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setStorage out of bounds for size 0, on 2xV100 with accelerate #409

Open
kno10 opened this issue May 1, 2024 · 1 comment
Open

setStorage out of bounds for size 0, on 2xV100 with accelerate #409

kno10 opened this issue May 1, 2024 · 1 comment

Comments

@kno10
Copy link

kno10 commented May 1, 2024

Trying to run unsloth via llamafactory on two V100s with CUDA 12.3 and accelerate, I get the error
RuntimeError: setStorage: sizes [4096, 8], strides [1, 4096], storage offset 0, and itemsize 4 requiring a storage size of 131072 are out of bounds for storage of size 0 in matmul_lora.

Traceback (most recent call last):
  File "LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "LLaMA-Factory/src/train_bash.py", line 5, in main  
    run_exp()
  File "LLaMA-Factory/src/llmtuner/train/tuner.py", line 31, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 47, in run_pt
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "<string>", line 361, in _fast_inner_training_loop
  File "conda/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step
    loss = self.compute_loss(model, inputs)
  File "conda/lib/python3.10/site-packages/transformers/trainer.py", line 3161, in compute_loss
    outputs = model(**inputs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 857, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/llama.py", line 882, in PeftModelForCausalLM_fast_forward
    return self.base_model(
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/mistral.py", line 213, in MistralForCausalLM_fast_forward
    outputs = self.model(
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/llama.py", line 650, in LlamaModel_fast_forward
    hidden_states = Unsloth_Offloaded_Gradient_Checkpointer.apply(
  File "conda/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 115, in decorate_fwd
    return fwd(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/_utils.py", line 333, in forward
    (output,) = forward_function(hidden_states, *args)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 857, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/llama.py", line 433, in LlamaDecoderLayer_fast_forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/mistral.py", line 69, in MistralAttention_fast_forward
    Q, K, V = self.apply_qkv(self, hidden_states)
  File "conda/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 312, in apply_lora_qkv
    Q, K, V = LoRA_QKV.apply(X,
  File "conda/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 115, in decorate_fwd
    return fwd(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 227, in forward
    Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
  File "conda/lib/python3.10/site-packages/unsloth/kernels/utils.py", line 240, in matmul_lora
    A, B = A.t(), B.t()
RuntimeError: setStorage: sizes [4096, 8], strides [1, 4096], storage offset 0, and itemsize 4 requiring a storage size of 131072 are out of bounds for storage of size 0

I have recreated the conda environment using the instrutions on the front page. If I disable unsloth, llamafactory works.

My best guess is that this is due to not being able to fit the entire model on one GPU for training (I have extended the vocabulary, so I have to fine-tune the embedding layers, not just a standard LoRA or even qLoRA)? I used deepspeed without unsloth on a first data subset, but I would expect unsloth to be much faster, and would like to use it.

@danielhanchen
Copy link
Contributor

Hmmm sadly multi GPU issues are not a top priority, since Unsloth's mission is to be the best single GPU library - ill see what I can do, but can't promise anything - sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants