指令微调单gpu报错 #34

dusens · 2024-05-09T06:07:52Z

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull）
已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、text-generation-webui等，建议优先去对应的项目中查找解决方案。

问题类型

None

基础模型

Llama-3-Chinese-8B-Instruct（指令模型）

操作系统

None

详细描述问题

数据是
 "instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n",
        "input": "###Input:\nHow many heads of the departments are older than 56 ?\n\n###Response:",
        "output": "SELECT count(*) FROM head WHERE age  >  56"
    },
    {
        "instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n",
        "input": "###Input:\nList the name, born state and age of the heads of departments ordered by age.\n\n###Response:",
        "output": "SELECT name ,  born_state ,  age FROM head ORDER BY age"
    },

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（请粘贴在本代码块里）

运行日志或截图

You are a helpful assistant. 你是一个乐于助人的助手。<|eot_id|><|start_header_id|>user<|end_header_id|>

I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.
"
##Instruction:
department_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.
Table head has columns such as head_ID, name, born_state, age. head_ID is the primary key.
Table management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.
The head_ID of management is the foreign key of head_ID of head.
The department_ID of management is the foreign key of Department_ID of department.


###Input:
How many heads of the departments are older than 56?

###Response:<|eot_id|><|start_header_id|>assistant<|end_header_id|>

SELECT count(*) FROM head WHERE age  >  56
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 436, in <module>
[rank0]:     main()
[rank0]:   File "/data/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 309, in main
[rank0]:     logger.info(f"Evaluation files: {' '.join(files)}")
[rank0]: TypeError: sequence item 0: expected str instance, NoneType found
E0509 05:54:34.000000 140225384735808 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 513749) of binary: /root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-09_05:54:34
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 513749)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

dusens · 2024-05-09T06:13:45Z

微调的参数是这样
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/data/models/llama-3-chinese-8b-instruct-v2
#pretrained_model=path/to/hf/meta-llama-3-8b/or/llama-3-chinese-8b/dir/or/model_id
#tokenizer_name_or_path=${pretrained_model}
tokenizer_name_or_path=/data/models/llama-3-chinese-8b-instruct-v2
#dataset_dir=path/to/sft/data/dir
dataset_dir=/data/Chinese-LLaMA-Alpaca-3/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
#output_dir=output_dir
output_dir=/data/models/llama-3-chinese-8b-instruct-v2-lora
validation_file=validation_file_name
#--validation_file ${validation_file} \

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${tokenizer_name_or_path}
--dataset_dir ${dataset_dir}
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--low_cpu_mem_usage
--do_eval
--seed $RANDOM
--bf16
--num_train_epochs 3
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.03
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--evaluation_strategy steps
--eval_steps 100
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--max_seq_length ${max_seq_length}
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--modules_to_save ${modules_to_save}
--torch_dtype bfloat16
--load_in_kbits 16
--ddp_find_unused_parameters False

iMountTai · 2024-05-09T12:36:49Z

你设置了do_eval，但是没有传入验证集文件

dusens · 2024-05-10T01:30:15Z

你设置了do_eval，但是没有传入验证集文件

去掉了还是不行一样的报错

Kikyo-chan · 2024-05-16T07:32:55Z

我在在单台（L20 GPU 48G，内存96G）ubuntu 22.04的cnoda环境中用微调llama-3-chinese-8b-instruct-v2也出现了类似问题，下载我贴出来希望能帮助到大家：
1.问题复现：
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 436, in
[rank0]: main()
[rank0]: File "/root/chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 345, in main
[rank0]: model = AutoModelForCausalLM.from_pretrained(
[rank0]: File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3078, in from_pretrained
[rank0]: raise ValueError("Passing along a `device_map` requires `low_cpu_mem_usage=True`")
[rank0]: ValueError: Passing along a `device_map` requires `low_cpu_mem_usage=True`
E0516 11:06:03.908000 140085417427584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 4046) of binary: /root/miniconda3/envs/chinese_LLaMA_Alpaca_3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_sft_with_peft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-16_11:06:03
host : autodl-container-d3af44b2e2-4a80e372
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4046)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

2.解决问题：
修改scripts/training/run_clm_sft_with_peft.py 中low_cpu_mem_usage值为True

3.为什么改low_cpu_mem_usage值为True
这个错误表明在使用 device_map 参数的同时，必须设置 low_cpu_mem_usage=True。这是因为在分布式环境中，尤其是在涉及到设备映射的场景下，low_cpu_mem_usage 参数有助于优化内存使用，特别是在加载大型模型时。

为什么会出现这种情况？

内存优化**: low_cpu_mem_usage 参数是为了减少在模型初始化阶段占用的CPU内存。当设置为 True 时，模型的权重在需要时才会被实际加载到内存中。这对于管理大型模型尤其重要，因为它们可能会占用大量内存，而延迟加载可以减少初始内存占用。
设备映射: 使用 device_map 允许将模型的不同部分映射到不同的设备（如GPU）。这种映射在大规模训练和推理中很有用，但它要求模型加载时内存使用被优化。
PyTorch 的要求**: 在某些情况下，PyTorch 或特定的库（如 Hugging Face 的 Transformers）需要在执行特定的操作（如设备映射）前确保内存使用得到优化。这可能是库内部的优化策略，旨在确保性能和稳定性。

为什么修改后问题解决了？
当您将 low_cpu_mem_usage 设置为 True 后，模型在加载时采用了内存优化策略，满足了与 device_map 一起使用的条件。这样，模型就能正确加载，且不会因为内存使用过高而导致问题。

放上我稍微修改过的微调脚本：
#!/bin/bash

基本配置

lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

模型和数据路径

pretrained_model=/root/models/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=/root/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/root/models/llama-3-chinese-8b-instruct-v2
validation_file=/root/XXXXXX.json

设置训练使用的GPU数量

num_gpus=1
export CUDA_VISIBLE_DEVICES=0

使用 torchrun 运行训练

torchrun --nproc_per_node=$num_gpus run_clm_sft_with_peft.py
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${tokenizer_name_or_path}
--dataset_dir ${dataset_dir}
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--do_eval
--seed $RANDOM
--bf16
--num_train_epochs 3
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.1
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--evaluation_strategy steps
--eval_steps 100
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--max_seq_length ${max_seq_length}
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--modules_to_save ${modules_to_save}
--torch_dtype bfloat16
--validation_file ${validation_file}
--load_in_kbits 16
--ddp_find_unused_parameters False
########################################
最后放一张训练图：

需要4个多小时，截至回复时，训练还没完成，不知道啥效果

dusens · 2024-05-21T01:39:11Z

谢谢很详细

dusens closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

指令微调单gpu报错 #34

指令微调单gpu报错 #34

dusens commented May 9, 2024

dusens commented May 9, 2024

iMountTai commented May 9, 2024

dusens commented May 10, 2024

Kikyo-chan commented May 16, 2024 •

edited

dusens commented May 21, 2024

指令微调单gpu报错 #34

指令微调单gpu报错 #34

Comments

dusens commented May 9, 2024

提交前必须检查以下项目

问题类型

基础模型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图

dusens commented May 9, 2024

iMountTai commented May 9, 2024

dusens commented May 10, 2024

Kikyo-chan commented May 16, 2024 • edited

run_clm_sft_with_peft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-05-16_11:06:03 host : autodl-container-d3af44b2e2-4a80e372 rank : 0 (local_rank: 0) exitcode : 1 (pid: 4046) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

基本配置

模型和数据路径

设置训练使用的GPU数量

使用 torchrun 运行训练

dusens commented May 21, 2024

Kikyo-chan commented May 16, 2024 •

edited

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-16_11:06:03
host : autodl-container-d3af44b2e2-4a80e372
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4046)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html