Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

指令微调单gpu报错 #34

Closed
3 tasks done
dusens opened this issue May 9, 2024 · 5 comments
Closed
3 tasks done

指令微调单gpu报错 #34

dusens opened this issue May 9, 2024 · 5 comments

Comments

@dusens
Copy link

dusens commented May 9, 2024

提交前必须检查以下项目

  • 请确保使用的是仓库最新代码(git pull)
  • 已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
  • 第三方插件问题:例如llama.cpptext-generation-webui等,建议优先去对应的项目中查找解决方案。

问题类型

None

基础模型

Llama-3-Chinese-8B-Instruct(指令模型)

操作系统

None

详细描述问题

数据是
 "instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n",
        "input": "###Input:\nHow many heads of the departments are older than 56 ?\n\n###Response:",
        "output": "SELECT count(*) FROM head WHERE age  >  56"
    },
    {
        "instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n",
        "input": "###Input:\nList the name, born state and age of the heads of departments ordered by age.\n\n###Response:",
        "output": "SELECT name ,  born_state ,  age FROM head ORDER BY age"
    },

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况(请粘贴在本代码块里)

运行日志或截图

You are a helpful assistant. 你是一个乐于助人的助手。<|eot_id|><|start_header_id|>user<|end_header_id|>

I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.
"
##Instruction:
department_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.
Table head has columns such as head_ID, name, born_state, age. head_ID is the primary key.
Table management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.
The head_ID of management is the foreign key of head_ID of head.
The department_ID of management is the foreign key of Department_ID of department.


###Input:
How many heads of the departments are older than 56?

###Response:<|eot_id|><|start_header_id|>assistant<|end_header_id|>

SELECT count(*) FROM head WHERE age  >  56
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 436, in <module>
[rank0]:     main()
[rank0]:   File "/data/Chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 309, in main
[rank0]:     logger.info(f"Evaluation files: {' '.join(files)}")
[rank0]: TypeError: sequence item 0: expected str instance, NoneType found
E0509 05:54:34.000000 140225384735808 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 513749) of binary: /root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/Chinese-LLaMA-Alpaca-3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_sft_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-09_05:54:34
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 513749)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
@dusens
Copy link
Author

dusens commented May 9, 2024

微调的参数是这样
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/data/models/llama-3-chinese-8b-instruct-v2
#pretrained_model=path/to/hf/meta-llama-3-8b/or/llama-3-chinese-8b/dir/or/model_id
#tokenizer_name_or_path=${pretrained_model}
tokenizer_name_or_path=/data/models/llama-3-chinese-8b-instruct-v2
#dataset_dir=path/to/sft/data/dir
dataset_dir=/data/Chinese-LLaMA-Alpaca-3/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
#output_dir=output_dir
output_dir=/data/models/llama-3-chinese-8b-instruct-v2-lora
validation_file=validation_file_name
#--validation_file ${validation_file} \

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${tokenizer_name_or_path}
--dataset_dir ${dataset_dir}
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--low_cpu_mem_usage
--do_eval
--seed $RANDOM
--bf16
--num_train_epochs 3
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.03
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--evaluation_strategy steps
--eval_steps 100
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--max_seq_length ${max_seq_length}
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--modules_to_save ${modules_to_save}
--torch_dtype bfloat16
--load_in_kbits 16
--ddp_find_unused_parameters False

@iMountTai
Copy link
Collaborator

你设置了do_eval,但是没有传入验证集文件

@dusens
Copy link
Author

dusens commented May 10, 2024

你设置了do_eval,但是没有传入验证集文件

去掉了还是不行一样的报错

@Kikyo-chan
Copy link

Kikyo-chan commented May 16, 2024

我在在单台(L20 GPU 48G,内存96G)ubuntu 22.04的cnoda环境中用微调llama-3-chinese-8b-instruct-v2也出现了类似问题,下载我贴出来希望能帮助到大家:
1.问题复现:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 436, in
[rank0]: main()
[rank0]: File "/root/chinese-LLaMA-Alpaca-3/scripts/training/run_clm_sft_with_peft.py", line 345, in main
[rank0]: model = AutoModelForCausalLM.from_pretrained(
[rank0]: File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3078, in from_pretrained
[rank0]: raise ValueError("Passing along a device_map requires low_cpu_mem_usage=True")
[rank0]: ValueError: Passing along a device_map requires low_cpu_mem_usage=True
E0516 11:06:03.908000 140085417427584 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 4046) of binary: /root/miniconda3/envs/chinese_LLaMA_Alpaca_3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/chinese_LLaMA_Alpaca_3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_sft_with_peft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-16_11:06:03
host : autodl-container-d3af44b2e2-4a80e372
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4046)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

2.解决问题:
修改scripts/training/run_clm_sft_with_peft.py 中low_cpu_mem_usage值为True

3.为什么改low_cpu_mem_usage值为True
这个错误表明在使用 device_map 参数的同时,必须设置 low_cpu_mem_usage=True。这是因为在分布式环境中,尤其是在涉及到设备映射的场景下,low_cpu_mem_usage 参数有助于优化内存使用,特别是在加载大型模型时。

为什么会出现这种情况?

  1. 内存优化**: low_cpu_mem_usage 参数是为了减少在模型初始化阶段占用的CPU内存。当设置为 True 时,模型的权重在需要时才会被实际加载到内存中。这对于管理大型模型尤其重要,因为它们可能会占用大量内存,而延迟加载可以减少初始内存占用。
  2. 设备映射: 使用 device_map 允许将模型的不同部分映射到不同的设备(如GPU)。这种映射在大规模训练和推理中很有用,但它要求模型加载时内存使用被优化。
  3. PyTorch 的要求**: 在某些情况下,PyTorch 或特定的库(如 Hugging Face 的 Transformers)需要在执行特定的操作(如设备映射)前确保内存使用得到优化。这可能是库内部的优化策略,旨在确保性能和稳定性。

为什么修改后问题解决了?
当您将 low_cpu_mem_usage 设置为 True 后,模型在加载时采用了内存优化策略,满足了与 device_map 一起使用的条件。这样,模型就能正确加载,且不会因为内存使用过高而导致问题。

放上我稍微修改过的微调脚本:
#!/bin/bash

基本配置

lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

模型和数据路径

pretrained_model=/root/models/llama-3-chinese-8b-instruct-v2
tokenizer_name_or_path=${pretrained_model}
dataset_dir=/root/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=512
output_dir=/root/models/llama-3-chinese-8b-instruct-v2
validation_file=/root/XXXXXX.json

设置训练使用的GPU数量

num_gpus=1
export CUDA_VISIBLE_DEVICES=0

使用 torchrun 运行训练

torchrun --nproc_per_node=$num_gpus run_clm_sft_with_peft.py
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${tokenizer_name_or_path}
--dataset_dir ${dataset_dir}
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--do_eval
--seed $RANDOM
--bf16
--num_train_epochs 3
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.1
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--evaluation_strategy steps
--eval_steps 100
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--max_seq_length ${max_seq_length}
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--modules_to_save ${modules_to_save}
--torch_dtype bfloat16
--validation_file ${validation_file}
--load_in_kbits 16
--ddp_find_unused_parameters False
########################################
最后放一张训练图:
image
image
需要4个多小时,截至回复时,训练还没完成,不知道啥效果

@dusens
Copy link
Author

dusens commented May 21, 2024

谢谢 很详细

@dusens dusens closed this as completed May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants