Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colab中微调报错: CUDA out of memory #44

Open
3 tasks done
chenmonster opened this issue May 16, 2024 · 9 comments
Open
3 tasks done

Colab中微调报错: CUDA out of memory #44

chenmonster opened this issue May 16, 2024 · 9 comments

Comments

@chenmonster
Copy link

提交前必须检查以下项目

  • 请确保使用的是仓库最新代码(git pull)
  • 已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
  • 第三方插件问题:例如llama.cpptext-generation-webui等,建议优先去对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Llama-3-Chinese-8B-Instruct(指令模型)

操作系统

Linux

详细描述问题

lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=hfl/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=./datasets--kigner--ruozhiba-llama3-tt/snapshots/2400d68db1bed109395e7470a6d9910581b21200
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=output_dir
validation_file=validation_file_name

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${tokenizer_name_or_path} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --low_cpu_mem_usage \
    --seed $RANDOM \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --load_in_kbits 4 \
    --ddp_find_unused_parameters False

依赖情况(代码类问题务必提供)

bitsandbytes                     0.43.1
peft                             0.7.1
sentencepiece                    0.1.99
torch                            2.2.1+cu121
torchaudio                       2.2.1+cu121
torchdata                        0.7.1
torchsummary                     1.5.1
torchtext                        0.17.1
torchvision                      0.17.1+cu121
transformers                     4.40.2

运行日志或截图

运行报错.png

[INFO|modeling_utils.py:4178] 2024-05-16 03:58:24,735 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at hfl/llama-3-chinese-8b-instruct-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:883] 2024-05-16 03:58:24,835 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--hfl--llama-3-chinese-8b-instruct-v2/snapshots/15cfcd776b55047b601bf6635052f059ca754ded/generation_config.json
[INFO|configuration_utils.py:928] 2024-05-16 03:58:24,835 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128009
  ],
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

05/16/2024 03:58:25 - INFO - __main__ - Model vocab size: 128256
05/16/2024 03:58:25 - INFO - __main__ - len(tokenizer):128256
05/16/2024 03:58:25 - INFO - __main__ - Init new peft model
05/16/2024 03:58:25 - INFO - __main__ - target_modules: ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj']
05/16/2024 03:58:25 - INFO - __main__ - lora_rank: 64
Traceback (most recent call last):
  File "/content/run_clm_sft_with_peft.py", line 439, in <module>
    main()
  File "/content/run_clm_sft_with_peft.py", line 391, in main
    model = get_peft_model(model, peft_config)
  File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 133, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1043, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 126, in __init__
    self.set_additional_trainable_modules(peft_config, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 631, in set_additional_trainable_modules
    _set_trainable(self, adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 276, in _set_trainable
    target.update(adapter_name)
  File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 190, in update
    self.modules_to_save.update(torch.nn.ModuleDict({adapter_name: copy.deepcopy(self.original_module)}))
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
    value = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parameter.py", line 59, in __deepcopy__
    result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 14.75 GiB of which 695.06 MiB is free. Process 153578 has 14.07 GiB memory in use. Of the allocated memory 13.89 GiB is allocated by PyTorch, and 64.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-05-16 03:58:32,295] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 12197) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-16_03:58:32
  host      : 9af8a5d71495
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 12197)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
@ymcui
Copy link
Owner

ymcui commented May 16, 2024

什么GPU OOM了?

@chenmonster
Copy link
Author

什么GPU OOM了?

Tesla T4

@ymcui
Copy link
Owner

ymcui commented May 16, 2024

你的运行脚本里modules_to_save="embed_tokens,lm_head",这两个部分不是LoRA训练。
可以考虑设置为None,看看是否能训起来。

@chenmonster
Copy link
Author

你的运行脚本里modules_to_save="embed_tokens,lm_head",这两个部分不是LoRA训练。 可以考虑设置为None,看看是否能训起来。

去掉这个参数这个还是这个错误

@ymcui
Copy link
Owner

ymcui commented May 17, 2024

你重新加载运行时了吗?确保显卡清空RAM之后再运行。
昨天用colab T4都能跑通的(modules_to_save=None),你自己再检查一下吧。
或者你用其他兼容llama-3训练精调的工具也都可以。

@chenmonster
Copy link
Author

AutoDL AI算力云 上使用 V100-32GB 的显卡,可以正常运行。
训练后怎么转成GGUF格式的模型呢?

@chenmonster
Copy link
Author

当更改默认系统提示词 DEFAULT_SYSTEM_PROMPT ,将其内容变多后,训练时也会报 CUDA out of memory 错误。

@alannesta
Copy link

colab T4 可以--load_in_kbits 8吗 ? 内存会不够吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants