setStorage out of bounds for size 0, on 2xV100 with accelerate #409

kno10 · 2024-05-01T14:19:30Z

Trying to run unsloth via llamafactory on two V100s with CUDA 12.3 and accelerate, I get the error
RuntimeError: setStorage: sizes [4096, 8], strides [1, 4096], storage offset 0, and itemsize 4 requiring a storage size of 131072 are out of bounds for storage of size 0 in matmul_lora.

Traceback (most recent call last):
  File "LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "LLaMA-Factory/src/train_bash.py", line 5, in main  
    run_exp()
  File "LLaMA-Factory/src/llmtuner/train/tuner.py", line 31, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 47, in run_pt
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "<string>", line 361, in _fast_inner_training_loop
  File "conda/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step
    loss = self.compute_loss(model, inputs)
  File "conda/lib/python3.10/site-packages/transformers/trainer.py", line 3161, in compute_loss
    outputs = model(**inputs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 857, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/llama.py", line 882, in PeftModelForCausalLM_fast_forward
    return self.base_model(
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/mistral.py", line 213, in MistralForCausalLM_fast_forward
    outputs = self.model(
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/llama.py", line 650, in LlamaModel_fast_forward
    hidden_states = Unsloth_Offloaded_Gradient_Checkpointer.apply(
  File "conda/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 115, in decorate_fwd
    return fwd(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/_utils.py", line 333, in forward
    (output,) = forward_function(hidden_states, *args)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 857, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/llama.py", line 433, in LlamaDecoderLayer_fast_forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/models/mistral.py", line 69, in MistralAttention_fast_forward
    Q, K, V = self.apply_qkv(self, hidden_states)
  File "conda/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 312, in apply_lora_qkv
    Q, K, V = LoRA_QKV.apply(X,
  File "conda/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 115, in decorate_fwd
    return fwd(*args, **kwargs)
  File "conda/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 227, in forward
    Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
  File "conda/lib/python3.10/site-packages/unsloth/kernels/utils.py", line 240, in matmul_lora
    A, B = A.t(), B.t()
RuntimeError: setStorage: sizes [4096, 8], strides [1, 4096], storage offset 0, and itemsize 4 requiring a storage size of 131072 are out of bounds for storage of size 0

I have recreated the conda environment using the instrutions on the front page. If I disable unsloth, llamafactory works.

My best guess is that this is due to not being able to fit the entire model on one GPU for training (I have extended the vocabulary, so I have to fine-tune the embedding layers, not just a standard LoRA or even qLoRA)? I used deepspeed without unsloth on a first data subset, but I would expect unsloth to be much faster, and would like to use it.

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-05-01T18:29:38Z

Hmmm sadly multi GPU issues are not a top priority, since Unsloth's mission is to be the best single GPU library - ill see what I can do, but can't promise anything - sorry!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setStorage out of bounds for size 0, on 2xV100 with accelerate #409

setStorage out of bounds for size 0, on 2xV100 with accelerate #409

kno10 commented May 1, 2024 •

edited

danielhanchen commented May 1, 2024

setStorage out of bounds for size 0, on 2xV100 with accelerate #409

setStorage out of bounds for size 0, on 2xV100 with accelerate #409

Comments

kno10 commented May 1, 2024 • edited

danielhanchen commented May 1, 2024

kno10 commented May 1, 2024 •

edited